CN114973219A

CN114973219A - Text content extraction method and device

Info

Publication number: CN114973219A
Application number: CN202110212571.1A
Authority: CN
Inventors: 宋思博; 杨志博; 王永攀
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2022-08-30

Abstract

The embodiment of the application discloses a method and a device for extracting text contents. According to the embodiment of the application, after the image features of the at least two image Chinese character segments are extracted, similarity calculation is carried out on similar parts of the image Chinese character segments, the similarity between the image features of the image Chinese character segments is determined, and the character segments are combined according to the similarity, so that the character contents corresponding to the at least two images are obtained. Because the similarity between the image characteristics is determined by performing similarity calculation on the similar parts of the character segments in the at least two images, namely performing alignment processing on the character segments in the at least two images after dynamic matching, and performing similarity calculation after the similar parts are corresponding, compared with a scheme of directly performing similarity calculation on the character segments, the obtained similarity more accurately represents the similarity between the character segments, and the image character segments can be better associated without marking the sequence characteristics between the character segments.

Description

Text content extraction method and device

Technical Field

The application relates to the technical field of data processing, in particular to a text content extraction method and device, an image information analysis network processing method and device, an image retrieval method and device, a course content extraction method and device, a subtitle extraction method and device, corresponding electronic equipment and a machine readable medium.

Background

With the rapid popularization of personal-oriented consumer-grade electronic products such as mobile phones and tablet computers and the rapid development of social media and video platforms such as long videos, short videos and live broadcasts, more and more users begin to perform daily activities such as knowledge learning, drama entertainment and news browsing through video media.

The video usually contains a large amount of text information in various forms such as later captions, scene texts and the like, and the association and extraction of the video texts have very important significance in different scenes. If the text association technology is not used or the effect of the association technology is poor, the relevance of the video content extraction result on the content and the time is poor, and the text information provided to the client is scattered and unordered, so the quality of the association effect directly influences the experience of the user in using the video product.

With the diversification of video application scenes, the change range of characters between video frames on the form and content is large, and the existing association scheme of character segments based on the image characteristics of the character segments has poor effect on character segment association and video content extraction when applied to scenes with large changes of character segment form and content.

Disclosure of Invention

In view of the above, the present application has been made to provide a method and apparatus for text content extraction, processing of an image information parsing network, image retrieval, extraction of curriculum contents, subtitle extraction, and a computer device, machine readable medium that overcome or at least partially solve the above-mentioned problems.

According to an aspect of the present application, there is provided a text content extraction method, including:

extracting image characteristics of character segments in at least two images;

determining similarity between image features of the at least two images of the character segments by performing similarity calculation on similar parts of the at least two images of the character segments;

and combining the text segments in the at least two images according to the similarity to obtain text contents corresponding to the at least two images.

According to another aspect of the present application, there is provided a processing method of an image information parsing network, including:

acquiring a correlation image sample pair;

respectively extracting the image characteristics of the Chinese character fragments of the associated image sample;

performing similarity calculation on similar parts of the Chinese character segments by the associated image samples to determine the similarity between the image characteristics of the Chinese character segments by the associated image samples;

and training an image information analysis network according to the determined similarity aiming at the associated image sample pair, and extracting the image characteristics of the character segments in the image.

obtaining a ternary sample group comprising associated image sample pairs and non-associated image sample pairs;

respectively extracting image characteristics of the character segments in the ternary sample group;

determining similarity between image characteristics of non-associated image samples to Chinese character segments by performing similarity calculation on similar parts of the non-associated image samples to the Chinese character segments;

and training an image information analysis network according to the similarity determined for the associated image sample pairs and the similarity determined for the non-associated image sample pairs, wherein the image information analysis network is used for extracting the image characteristics of the character segments in the image.

According to another aspect of the present application, there is provided an image retrieval method including:

acquiring a first image of a retrieval basis;

extracting image characteristics of the character segments in the first image;

determining similarity between image features of the text segments in the first image and the second image by performing similarity calculation on similar parts of the text segments respectively included in the first image and the second image;

determining that the second image is a related image of the first image according to the similarity;

and providing the associated image as an image retrieval result.

According to another aspect of the present application, there is provided a method for extracting course content, including:

extracting image characteristics of character segments in image frames of the curriculum videos;

determining the similarity between the image characteristics of the character segments respectively corresponding to two adjacent image frames by performing similar calculation on the similar parts of the character segments in the two adjacent image frames;

and combining the character segments in the two adjacent image frames with the similarity meeting the set range to obtain the character content corresponding to the course video.

According to another aspect of the present application, there is provided a subtitle extraction method including:

respectively identifying image characteristics of subtitle fragments from image frames of a target video;

similarity calculation is carried out on similar parts of character segments in two adjacent image frames, and similarity between image features of caption segments corresponding to the two adjacent image frames is determined;

and merging the subtitle fragments in the two adjacent image frames with the similarity meeting the set range to obtain the subtitle content corresponding to the target video.

In accordance with another aspect of the present application, there is provided a video content extraction method including:

extracting image characteristics of character segments in image frames of the target video;

determining similarity between image characteristics of the character segments in the two adjacent image frames by performing similarity calculation on similar parts of the character segments in the two adjacent image frames;

and combining the character segments in the two adjacent image frames according to the similarity to obtain the character content corresponding to the target video.

According to another aspect of the present application, there is provided a conference content processing method, including:

acquiring a conference video in real time;

extracting image characteristics of character segments in image frames of the conference video;

combining character segments in two adjacent image frames with similarity meeting a set range;

taking the text content obtained after the merging processing as a subtitle and adding the subtitle to the conference video;

and providing the conference video added with the subtitles.

In accordance with another aspect of the present application, there is provided a remote video processing method including:

identifying image characteristics of text segments in image frames of the telemedicine video;

updating the text segments in the remote medical video frame according to the text content obtained after the merging;

providing the updated telemedicine video.

According to another aspect of the present application, there is provided a text content extraction method, including:

acquiring at least two submitted images;

extracting image characteristics of the text segments in the at least two images;

combining the text segments in the at least two images according to the similarity to obtain text contents corresponding to the at least two images;

and providing the text content.

In accordance with another aspect of the present application, there is provided an electronic device including: a processor; and

a memory having executable code stored thereon, which when executed, causes the processor to perform a method as in any one of the above.

According to another aspect of the application, there is provided one or more machine-readable media having stored thereon executable code that, when executed, causes a processor to perform a method as any one of the above.

According to the embodiment of the application, after the image features of the at least two image Chinese character segments are extracted, similarity calculation is carried out on similar parts of the at least two image Chinese character segments, the similarity between the image features of the at least two image Chinese character segments is determined, and further the character segments are combined according to the similarity, so that the character content corresponding to the at least two images is obtained. Because the similarity between the image characteristics is determined by performing similarity calculation on the similar parts of the character segments in the at least two images, namely performing alignment processing on the character segments in the at least two images after dynamic matching, and performing similarity calculation after the similar parts are corresponding to each other, compared with a scheme of directly performing similarity calculation on the character segments, the obtained similarity more accurately represents the similarity between the character segments, and the sequence characteristics between the character segments do not need to be labeled, so that the image character segments can be better correlated, the extraction result of the video content is concise and complete, the requirement of the output content is met, and the use experience of a user is improved. Moreover, the method can better adapt to the sequence characteristics of partial overlapping between text segments of video frames in a video scene, and has stronger universality.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows an example of a video content extraction method of the present application;

fig. 2 is a flowchart illustrating a text content extraction method according to a first embodiment of the present application;

FIG. 3 is a flowchart of a processing method of an image information parsing network according to the second embodiment of the present application;

FIG. 4 is a flowchart of a processing method of an image information parsing network according to a third embodiment of the present application;

FIG. 5 is a flowchart of an image retrieval method according to the fourth embodiment of the present application;

fig. 6 is a flowchart illustrating a method for extracting curriculum contents according to a fifth embodiment of the present application;

fig. 7 shows a flowchart of a subtitle extraction method according to a sixth embodiment of the present application;

fig. 8 shows a flowchart of a video content extraction method according to a seventh embodiment of the present application;

fig. 9 shows a flowchart of a conference content handler according to an eighth embodiment of the present application;

FIG. 10 is a flow chart of a method of remote video processing according to an embodiment nine of the present application;

fig. 11 is a flowchart illustrating a text content extraction method according to a tenth embodiment of the present application;

fig. 12 is a block diagram showing a configuration of a text content extraction apparatus according to an eleventh embodiment of the present application;

fig. 13 is a block diagram illustrating a processing apparatus of an image information analysis network according to a twelfth embodiment of the present application;

fig. 14 is a block diagram showing a processing apparatus of an image information analysis network according to a thirteenth embodiment of the present application;

fig. 15 is a block diagram showing a configuration of an image retrieval apparatus according to a fourteenth embodiment of the present application;

fig. 16 is a block diagram showing a configuration of a lesson content extracting apparatus according to a fifteenth embodiment of the present application;

fig. 17 is a block diagram showing a configuration of a subtitle extracting apparatus according to a sixteenth embodiment of the present application;

fig. 18 is a block diagram showing a configuration of a video content extraction apparatus according to a seventeenth embodiment of the present application;

fig. 19 is a block diagram showing a configuration of a conference content processing apparatus according to eighteen embodiments of the present application;

fig. 20 is a block diagram showing a configuration of a remote video processing apparatus according to nineteen embodiments of the present application;

fig. 21 is a block diagram showing a configuration of a text content extraction apparatus according to an embodiment twenty of the present application;

FIG. 22 shows a schematic diagram of terminal interaction with a cloud service platform;

fig. 23 illustrates an exemplary system that can be used to implement various embodiments described in this disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The existing video character association algorithm can be mainly divided into an association algorithm based on the character position relation of adjacent video frames, an association algorithm based on the character line image feature similarity relation and fusion of the two methods.

The text association algorithm based on the text position relationship is a scheme based on an area Intersection ratio (Intersection over Union), that is, whether the ratio of the overlapping area of two text lines of a current frame and a previous frame to the total coverage area is greater than a certain threshold value is calculated to judge whether the text of the previous frame and the text of the next frame are in an association relationship. Since only the characteristics of the text contents themselves are considered, the recognition effect of the text having the display effect is poor.

The method for extracting the relevant display characteristics designed in advance, such as the brightness, the background and the like of the character line based on the traditional image characteristics can better adapt to the scene with the character display effect compared with a character association algorithm, but is sensitive to the conditions of illumination change, character shielding, similar background and the like, so that the effect on the general use of the scene is poor.

Especially, in a scene with a large change in the form and content of text segments, the effect of text segment association and video content extraction is not good.

In view of this, the embodiment of the present application provides a new text content extraction scheme, and provides corresponding application schemes related to processing of an image information analysis network, image retrieval, extraction of course content, and subtitle extraction.

In the text content extraction scheme of the embodiment of the application, the similarity of the text segments is judged by taking the image features of the text segments in the image as a basis, the similar text segments are determined as the associated text segments, and the text content of the image is integrated according to the associated text segments.

The character segment may be a character line, i.e., a plurality of characters arranged in a line, or may be a plurality of characters in other forms.

The scheme of the embodiment of the application can be applied to text content extraction between two or more images, and the following description is given for two images as an example.

The similarity between the image features of the text segments is determined by performing similarity calculation on the similar parts of the text segments, that is, in the embodiment of the present application, when performing the similarity calculation on the image features, the similar parts and the non-similar parts between the text segments of two images are distinguished, for example, the similar parts are subjected to the similarity calculation, and the other parts are subjected to the similarity calculation. The similarity calculation results of different parts can be processed into the similarity between the image characteristics of the Chinese character segments of the associated image sample pairs by means of addition, weighted average and the like. The similar part can be a character segment which is divided into a plurality of sub-features on the width of a feature space, or can be a plurality of sub-features with the same length or different lengths, the sub-features corresponding to the two images are compared to find out the similar part, and the similar part can be a part with the similarity meeting a certain numerical range or a part with the similarity higher than that of other parts.

The feature space of the image features generally has attributes in three aspects of width, height and feature channel, and the division on the feature space width can determine the number of divisions and the length of a single sub-feature according to actual requirements.

For example, the text segments extracted from the two images are "happy new" and "new year", respectively, and the image regions corresponding to the two text segments may be divided into image features corresponding to three characters as a set of sub-features, that is, "happy new" is divided into "hap", "py" and "new", and "new year" is divided into "new", "ye" and "ar", where "new" is a similar part of two associated image samples. If similarity among three groups of sub-features is calculated according to the scheme of the prior art, namely similarity between 'hap' and 'new', 'py' and 'ye', 'new' and 'ar' is calculated respectively, according to the scheme of the application, 'new' belongs to the similar parts of two samples, similarity calculation is carried out on the samples, and when similarity calculation is carried out on the other samples, the similarity calculation can be carried out randomly or by selecting two samples according to corresponding positions.

If the similarity meets the set condition, the text segments have the same content or similar content, and the two text segments are associated text segments, and at this time, the text segments in the two images can be combined to obtain the text content corresponding to the two images. The similarity meeting the set condition may be that the similarity exceeds a set threshold, or that the similarity represented by the similarity is ranked first, and the text segments of the two images are determined to be related text segments.

The merging process here may be a process of removing duplicates or a process of filling up repeated text segments. For the case that the subtitles in the video have overlapping portions in two images, the deduplication may be performed, for example, if the two images respectively correspond to "happy new layer" and "happy new layer to you all", then "happy new layer to you all" is obtained after the deduplication. For the case that the text segment in a single image is not completely displayed, the process may be performed with filling, that is, the local text is temporally integrated, for example, two images respectively correspond to "happy new" and "year to you all", and then "happy new year to you all" is obtained after filling, and also two processes may be simultaneously performed, for example, two images respectively correspond to "happy new" and "new year", and then "happy new year" is obtained after de-duplication and filling.

Through the above-mentioned filtering to repeated literal content and local characters integrating in time sequence, the generated video content is simple and complete, the requirement of output content is met, and the use experience of the user is greatly improved.

In an optional embodiment, the text content obtained by the merging may be provided on an equipment interface, and an associated editing control is provided, so that an equipment user (e.g., a video subtitle producer) may edit the text content based on the editing control, obtain updated text content based on an editing operation on the text content, and further display the updated text content on the equipment interface. Or, the text content can be confirmed, and the confirmed result is displayed on an equipment interface, or the text segment in the image is updated.

The image feature extraction is a concept in computer vision and image processing, and refers to information of one or more set dimensions of an image extracted by using a computer, in the embodiment of the present application, the image feature of a text segment is an image feature of an image region where the text segment is located, and may include an image feature presented by the content of the text segment itself, for example, a color feature, a texture feature, a shape feature, and a spatial relationship feature, where the color feature and the texture feature describe surface properties of a scene corresponding to the image or the image region, the shape feature may be a contour feature of an outer boundary of an object or a region feature representing information of the located region, and the spatial relationship feature refers to a spatial position relationship between multiple objects of the image.

The image features may represent an image or a partial region in the image, and thus may be used to replace the image or the region itself to perform various comparisons or operations, for example, for text segments in at least two images, if the similarity of the image features corresponding to the text segments meets a certain condition, it is indicated that the two text segments are similar.

The feature types extracted by the method can be distributed in multiple dimensions, the method is not limited by the method, and for example, the method can include low-level features such as brightness and color histograms and the like, has good robustness on character lines with angle and scale changes in videos, and can also include high-level features such as Scale Invariant Feature Transform (SIFT), speeded up version robust features (SURF) and the like, and has good identification stability for changes of conditions such as illumination, translation and rotation. To accommodate scenes where fast motion of the shot or scene of the video may cause image blur, constraints on filtering and geometric registration for feature mismatching by random sample consensus algorithm (RANSAC) may also be superimposed.

In an optional embodiment, when the image features of the text segments in the image frame of the target video are extracted, the image region where the text segments are located may be divided into a plurality of character regions, the image features of the character regions are respectively extracted, and the image features of the character regions are combined into the image features of the text segments. Specifically, the extracted image features may be represented as a matrix [ w, k ], where w represents the spatial width of the features and k represents the number of channels of the image features, i.e., the number of dimensions of the image features.

The image area may be divided according to the number of preset character areas, for example, the image area is divided into 4 character areas, and the space width of the image feature is 4 unit widths. Specifically, aggregation statistics can be performed on image features at different positions in a single character region through pooling operation to obtain the image features in the character region, and output image features can be in the same characterization scale through normalization processing.

The similarity calculation may be implemented by any similarity calculation algorithm, for example, a euclidean distance, a cosine distance, a manhattan distance (L1 norm distance), or the like may be used to calculate a distance between image features as a similarity between the image features.

In an optional embodiment, when the similarity between the image features of the text segments in the at least two images is determined by performing similarity calculation on the similar parts of the text segments in the at least two images, the similarity between the sub-features of the text segments of the associated image sample pair may be determined according to a plurality of sub-features of the text segments divided over the feature space width, that is, pairwise similarity calculation is performed on all the sub-features respectively corresponding to the associated image samples. Further, a required corresponding relationship is selected from all the similarity degrees calculated above, that is, the sub-features including the similar parts are associated according to the similarity degrees between the sub-features, for example, according to the size or the sequence of the similarity degrees, two sub-features with satisfactory similarity degrees can be determined to be associated as the sub-features including the similar parts.

And determining the similarity between the image characteristics of the Chinese character fragments of the associated image sample according to the corresponding relation of the associated sub-characteristics. For example, the similarity calculated correspondingly is obtained according to the sub-features including the similar parts determined by the similarity, the similarity is calculated by the remaining sub-features, and then the similarity calculated by each sub-feature is subjected to addition or weighting and other processing to obtain the similarity corresponding to the two character segments.

In an optional embodiment, when sub-features including similar parts are associated according to the similarity between the sub-features, the association may be performed by a minimum path method, a similarity matrix is constructed according to the similarity between all the sub-features of the text segments of the associated image sample pair, a minimum cost path of the similarity matrix is determined, a corresponding relationship of the sub-features represented by the minimum cost path is used as a corresponding relationship of the associated sub-features, since the similarity is represented by a distance between the image features, a sum of the similarities in the minimum cost path is minimum, that is, it is integrated that the distances between the image features of all the sub-features are minimum, two sub-features corresponding to each similarity on the minimum cost path are obtained by selecting the most similar sub-features from all the sub-features to perform the similarity calculation.

The minimum cost path can be determined by using dynamic programming, a greedy algorithm, a Dijkstra algorithm (single-source shortest path algorithm), a Bellman-Ford algorithm and the like, taking the dynamic programming algorithm as an example, all possible solutions are traversed by enumeration based on the basic idea of interdependence between optimal substructures, and optimal calculation is carried out at each path subnode, so that the optimal solution is found out.

Further, when determining the similarity between the image features of the Chinese character segments of the associated image samples according to the corresponding relationship of the associated sub-features, the similarity between the image features of the Chinese character segments of the associated image samples can be determined according to the similarity of the sub-features on the minimum cost path because the similarity of the most similar sub-features on the minimum cost path is calculated. According to the above discussion, compared with a scheme of directly performing similarity calculation on the text segments, the obtained similarity more accurately represents the similarity between the text segments.

In an alternative embodiment, the image features of the text segments in the image frames may be extracted based on an image information analysis network, where the image information analysis network is used to extract the image features of the text segments, and the image information analysis network is obtained by training pairs of associated image samples collected in advance.

The associated image sample pairs include two image samples having an association on a text segment, where the associated image sample pairs collected include a plurality of groups. And training the image information analysis network according to the similarity between the image characteristics of the Chinese character segments of the associated image sample pairs, so that the similarity calculated according to the image characteristics extracted by the image information analysis network for the associated image sample pairs accords with the similarity characteristic of the associated image sample pairs.

Compared with the existing scheme of extracting the pre-designed image features, the method and the device for extracting the image features can adopt the deep learning algorithm to train the image information analysis network, can better adaptively learn the image features under the conditions of projection transformation, illumination shielding, shaking blurring and the like, can be suitable for the condition that the form and the content of the character segments are greatly changed in the video process, and can more accurately correlate the character segments of the video frames.

The image information analysis Network may be a Convolutional Neural Network (CNN), or may be another self-learning Neural Network, such as a Recurrent Neural Network (RNN).

Before the scheme, an image information analysis network can be trained in advance, specifically, a correlation image sample pair can be obtained, and the image information analysis network for extracting the image characteristics of the character segments in the image can be trained according to the correlation image sample pair.

In an optional embodiment, image features of the associated image samples on the Chinese character segments can be respectively extracted, similarity calculation is performed on similar parts of the associated image samples on the Chinese character segments to determine similarity between the image features of the associated image samples on the Chinese character segments, and an image information analysis network is further trained according to the determined similarity for the associated image samples to extract the image features of the Chinese character segments in the images.

In an alternative embodiment, when obtaining the associated image sample pairs, a text segment (i.e., a sequence of multiple frames of the text segment) may be extracted from the image frames of the video samples, and then the image frames of the text segments corresponding to the similar portions are used as the associated image sample pairs, i.e., the positive sample pairs.

Furthermore, the image frames of the text segments without similar parts can be used as unassociated image sample pairs, that is, negative sample pairs, in combination with positive sample pairs, that is, a ternary sample group including the positive sample pairs and the negative sample pairs can be constructed, for example, the ternary sample group includes A, B, C three samples, a and B are positive sample pairs, and B and C or a and C form negative sample pairs.

For example, character segments, namely "ricomer", "brmer" and "ALL", are extracted from three consecutive image frames of a video, and after similarity calculation is performed by extracting image features of the text segments, it is determined that the first two text segments include similar content r and mer, and the first two text segments and the last text segment do not include similar content, the first two image frames may form an associated image sample pair, and the first two image frames may form a non-associated image sample pair with a third image frame, respectively.

And further determining the similarity between the image characteristics of the Chinese character segment of the non-associated image sample by performing similarity calculation on the similar part of the Chinese character segment of the non-associated image sample. Therefore, when the image information analysis network is trained according to the positive sample pair and the negative sample pair, the image information analysis network is trained according to the similarity determined for the associated image sample pair and the similarity determined for the non-associated image sample pair, namely, the training direction of the image information analysis network is the same, so that the similarity of the positive sample pair is higher, the similarity of the negative sample pair is lower, and the image information analysis network with more accurate representation can be obtained.

In an alternative embodiment, after the image information analysis network outputs the image features, when determining the similarity between the image features of the text segments in the associated image sample pairs, the dynamic matching and aligning scheme similar to the above may be continuously adopted, and the similarity between the image features of the text segments in the two adjacent image frames may be determined by performing similarity calculation on the similar portions of the text segments in the two adjacent image frames.

Correspondingly, when the similarity between the image characteristics of the Chinese character segment and the associated image sample is determined by performing similarity calculation on the similar part of the Chinese character segment and the associated image sample, the similarity between the sub-characteristics of the Chinese character segment and the associated image sample can be determined according to a plurality of sub-characteristics included in the character segment, the sub-characteristics including the similar part are associated according to the similarity between the sub-characteristics, and then the similarity between the image characteristics of the Chinese character segment and the associated image sample is determined according to the corresponding relation of the associated sub-characteristics.

In an optional embodiment, when the sub-features including the similar parts are associated according to the similarity between the sub-features, a similarity matrix can be constructed according to the similarity between all the sub-features of the text segments of the associated image sample pairs; and dynamically planning a minimum cost path of the similarity matrix, and taking the corresponding relation of the sub-features represented by the minimum cost path as the corresponding relation of the associated sub-features.

Correspondingly, when the similarity between the image features of the Chinese character segments of the associated image sample is determined according to the corresponding relation of the associated sub-features, the similarity between the image features of the Chinese character segments of the associated image sample can be determined according to the similarity of the sub-features on the minimum cost path.

It should be noted that the present application may be implemented as an application, a service, an instance, a functional module in a software form, a Virtual Machine (VM) or a container, or may also be implemented as a hardware device (such as a server or a terminal device) or a hardware chip (such as a CPU, a GPU or an FPGA) having an image processing function.

The cloud service platform can provide corresponding services of videos or images by using computing resources of the cloud service platform, a terminal user (such as a subtitle/video/image editing professional or a common user) can obtain processing services of the videos or the images through a client or a set interface, and after the terminal user obtains a processing result of the cloud service platform for a text segment in the videos or the images, the terminal user can further perform needed personalized editing.

Or the cloud service platform can provide a training function of the image information analysis network by using its own computing resources, the video platform can apply for training of the image information analysis network through a client or a set interface, and can also submit a video sample for extracting a ternary sample set, the cloud service platform obtains the image information analysis network according to the training of the ternary sample set by the method and provides the image information analysis network for the video platform, and the video platform further uses the image information analysis network for application in the relevant fields of subsequent video content extraction, course content extraction, subtitle extraction and the like.

The embodiment of the application can be applied to extraction of video contents in various application scenes, for example, processing of text segments in education live broadcast, entertainment live broadcast or remote course video, news video, street view video, commodity video and entertainment video, for example, animation texts of comprehensive programs, rolling subtitles in news programs, signboard information in street view video, commodity information of e-commerce live broadcast, package text input of bottle and can cylindrical commodities and the like can be provided, concise and complete information can be provided, and the method and the device can be conveniently viewed or used for subsequent production.

The video content can be extracted and applied to the field processing in the video conference scene, so that the key information in the conference video can be conveniently and timely obtained for the users participating in the video to check during the video conference or check after the video conference. The video content is extracted and applied to a remote video scene, and is extracted and processed according to the association of the character segments of the plurality of video frames, so that the medical key information can be conveniently and timely obtained for the users participating in the remote medical video to view in the video or view after the video.

Besides being applied to the extraction of video content, the method and the device can also determine the drawing-in and drawing-out time of certain text information in the video according to the association between the text segments of the video frames, so that the occurrence position of the information is marked on a time axis, and the text information is conveniently positioned according to time periods in the follow-up process.

An example of a textual content extraction method of the present application is shown with reference to fig. 1. Constructing a ternary sample group for training an image information analysis network by using images of three character segments, wherein an anchor is used as a basic sample, positive is a relative positive sample, negative is a relative negative sample, when the image information analysis network is trained, the widths of the anchor sample and the positive sample respectively extracted are a multidimensional matrix comprising four sub-features, similarity is calculated pairwise by using the four sub-features respectively corresponding to the anchor sample and the positive sample to obtain a four-dimensional matrix constructed by the similarity, a minimum cost path of the four-dimensional matrix is searched as shown in the figure, the sub-features corresponding to the similarity on the path are sub-features comprising similar parts, pairwise correspondence is carried out, the similarity between the image features of the whole character segment is calculated, and the similarity is used as a value of a first loss function. Because the similarity between the anchor sample and the negative sample is poor, a multidimensional matrix with the width of one sub-feature can be constructed, the similarity calculated correspondingly is used as the value of a second loss function, the sum of the two loss functions is used as a total loss function, and the image information analysis network is trained iteratively.

Fig. 1 also shows an example of extracting video content, where two text segments, "happy new" and "happy new year" are extracted from two video frames of a video, an input image information analysis network extracts image features, calculates similarity according to the image features, determines the text segments as associated text segments according to the similarity satisfying conditions, and performs merging processing to obtain the content "happy new year" extracted from the video.

Referring to fig. 2, a flowchart of a text content extraction method according to an embodiment of the present application is shown, where the method specifically includes the following steps:

step 101, extracting image characteristics of character segments in at least two images;

102, determining similarity between image characteristics of the character segments in the at least two images by performing similarity calculation on similar parts of the character segments in the at least two images;

and 103, combining the character segments in the at least two images according to the similarity to obtain character contents corresponding to the at least two images.

In an alternative embodiment, the extracting image features of text segments in at least two images includes:

dividing an image area where the text segments are located into a plurality of character areas;

and respectively extracting the image characteristics of the character areas, and combining the image characteristics of the character areas into the image characteristics of the character segments.

In an optional embodiment, the method further comprises:

and acquiring a correlation image sample pair, and training an image information analysis network for extracting image characteristics of the character segments in the image according to the correlation image sample pair.

In an optional embodiment, the network for parsing image information according to the associated image sample pairs for training image features used for extracting text segments in an image includes:

In an alternative embodiment, the determining the similarity between the image features of the Chinese character segment and the associated image sample by performing similarity calculation on the similar parts of the Chinese character segment and the associated image sample comprises:

dividing the image features into a plurality of sub-features on the width of a feature space, and determining the similarity between the sub-features of the character segments of at least two images;

according to the similarity between the sub-features, associating the sub-features comprising similar parts;

and determining the similarity between the image characteristics of the character segments in the at least two images according to the corresponding relation of the associated sub-characteristics.

In an optional embodiment, the associating the sub-features including the similar parts according to the similarity between the sub-features includes:

constructing a similarity matrix according to the similarity between all the sub-features of the text segments of the at least two images;

and determining a minimum cost path of the similarity matrix, and taking the corresponding relation of the sub-features represented by the minimum cost path as the corresponding relation of the associated sub-features.

In an optional embodiment, the determining, according to the correspondence of the associated sub-features, a similarity between image features of the associated image sample to the Chinese character segment includes:

and determining the similarity between the image characteristics of the Chinese character segments of the associated image sample pairs according to the similarity of the sub-characteristics on the minimum cost path.

In an alternative embodiment, said obtaining the associated image sample pair comprises:

respectively extracting character segments from image frames of a video sample;

the image frames corresponding to the similar portions of the text segment are taken as associated image sample pairs.

In an optional embodiment, the method further comprises:

taking the image frames of the text segments without similar parts as a non-associated image sample pair;

the training of the image information analysis network in accordance with the determined similarity for the associated image sample pairs comprises:

training an image information analysis network according to the similarity determined for the associated image sample pairs and the similarity determined for the non-associated image sample pairs.

The image frames of the character segments with the similar parts can be used as the associated image sample pairs, the image frames of the character segments without the similar parts can be used as the non-associated image sample pairs, namely, a ternary sample group comprising the positive sample pairs and the negative sample pairs can be constructed, and the loss function of the positive sample pairs and the loss function of the negative sample pairs are combined to be trained simultaneously, so that an image information analysis network with more accurate representation can be obtained.

In an alternative embodiment, the determining the similarity between the image features of the text segments in the at least two images by performing similarity calculation on similar parts of the text segments in the at least two images includes:

and determining the similarity between the image characteristics of the character segments in the at least two images by performing similarity calculation on similar parts of the character segments in the at least two images.

constructing a similarity matrix according to the similarity between all the sub-features of the character segments of the at least two images;

In an optional embodiment, the determining, according to the correspondence of the associated sub-features, a similarity between image features of the text segments in the at least two images includes:

and determining the similarity between the image characteristics of the character segments in the at least two images according to the similarity of the sub-characteristics on the minimum cost path.

In an optional embodiment, the merging the text segments in the two adjacent image frames according to the similarity includes:

and performing de-duplication processing and/or filling processing on the character segments in at least two images with the similarity meeting the set range.

In an optional embodiment, the method further comprises:

providing the obtained text content on an equipment interface;

and displaying the updated character content based on the editing operation of the obtained character content.

An embodiment of training an image information analysis network is given as follows, and referring to fig. 3, a flowchart of a processing method of an image information analysis network according to the second embodiment of the present application is shown, where the method may specifically include the following steps:

step 201, obtaining a pair of associated image samples;

step 202, respectively extracting image characteristics of the Chinese character fragments of the associated image samples;

step 203, performing similarity calculation on similar parts of the Chinese character segments by the associated image samples to determine the similarity between the image characteristics of the Chinese character segments by the associated image samples;

step 204, training an image information analysis network according to the determined similarity of the associated image sample pairs, and extracting the image features of the character segments in the image.

Compared with the existing scheme for extracting the pre-designed image features, the image information analysis network for extracting the image features of the text segments can be used for training the similarity between the image features of the text segments according to the associated image samples, so that the image features under the conditions of projection transformation, illumination shielding, shaking blurring and the like can be better and adaptively learned, the image information analysis network can be suitable for the conditions that the forms and the contents of the text segments are greatly changed in the video process, the text segments of video frames can be more accurately associated, the extraction result of the video contents is concise and complete, the requirement for outputting the contents is met, and the use experience of users is improved.

Because the image characteristics are extracted based on the image information analysis network, the image information analysis network trains the similarity between the image characteristics of the Chinese character segments according to the associated image samples, and the similarity between the image characteristics is determined by performing similar calculation on the associated image samples on the similar parts of the Chinese character segments, namely, the associated image samples are subjected to alignment processing after dynamically matching the Chinese character segments, and the similar parts are subjected to corresponding similar calculation, compared with a scheme of directly performing similar calculation on the character segments, the obtained similarity more accurately represents the similarity between the character segments, the sequence characteristics among the character segments are not required to be marked, the sequence characteristics of the partial overlapping between the character segments of the video frames under the video scene can be better adapted, the universality is stronger, and the image characteristics extracted by the image information analysis network and trained according to the similarity, the sequence characteristics of the character segments are more robust, and the character segments of the adjacent video frames can be better associated when the method is used for calculating the similarity of the character segments.

An embodiment of training an image information analysis network according to a ternary sample set is given as follows, and referring to fig. 4, a flowchart of a processing method of an image information analysis network according to a third embodiment of the present application is shown, where the method specifically may include the following steps:

step 301, obtaining a ternary sample group including a related image sample pair and a non-related image sample pair;

step 302, respectively extracting image characteristics of the character segments in the ternary sample group;

step 303, performing similarity calculation on similar parts of the Chinese character segments by the associated image samples to determine the similarity between the image characteristics of the Chinese character segments by the associated image samples;

step 304, determining similarity between image characteristics of non-associated image samples to Chinese character segments by performing similarity calculation on similar parts of the non-associated image samples to the Chinese character segments;

step 305, training an image information analysis network according to the similarity determined for the associated image sample pairs and the similarity determined for the non-associated image sample pairs, for extracting the image features of the text segments in the image.

Because the image characteristics are extracted based on the image information analysis network, the image information analysis network trains the similarity between the image characteristics of the Chinese character segments according to the associated image samples, and the similarity between the image characteristics is determined by performing similar calculation on the associated image samples on the similar parts of the Chinese character segments, namely, the associated image samples are subjected to alignment processing after dynamically matching the Chinese character segments, and the similar parts are subjected to corresponding similar calculation, compared with a scheme of directly performing similar calculation on the character segments, the obtained similarity more accurately represents the similarity between the character segments, the sequence characteristics among the character segments are not required to be marked, the sequence characteristics of the partial overlapping between the character segments of the video frames under the video scene can be better adapted, the universality is stronger, and the image characteristics extracted by the image information analysis network and trained according to the similarity, the sequence characteristics of the character segments are more robust, and the character segments of the adjacent video frames can be better associated when the method is further used for calculating the similarity of the character segments.

The following steps are provided for using the image features of the text segments in the association search between images, calculating a second image similar to the first image according to the image features extracted by the image information analysis network, and providing the second image as a search result. Referring to fig. 5, a flowchart of an image retrieval method according to a fourth embodiment of the present application is shown, where the method specifically includes the following steps:

step 401, acquiring a first image of a retrieval basis;

step 402, extracting image characteristics of the character segments in the first image;

step 403, determining similarity between image features of the text segment in the first image and the text segment in the second image by performing similarity calculation on similar parts of the text segments respectively included in the first image and the second image;

step 404, determining the second image as a related image of the first image according to the similarity;

step 405, providing the associated image as an image retrieval result.

According to the embodiment of the application, after the image features of the character segments in the two images are extracted, the similarity between the image features of the character segments in the two images is determined by performing similarity calculation on the similar parts of the character segments in the two images, and the character segments are further combined according to the similarity, so that the character contents corresponding to the two images are obtained. The similarity between the image characteristics is determined by performing similarity calculation on the similar parts of the character segments in the two images, namely, the character segments in the two images are subjected to alignment processing after being dynamically matched, and the similar parts are subjected to similarity calculation after being corresponding. Moreover, the method can better adapt to the sequence characteristics of partial overlapping between the text segments of the video frames in the video scene, has stronger universality, and can better search similar images when being used for calculating the similarity of the text segments.

The scheme for associating and processing the at least two image Chinese character segments can be applied to processing of the Chinese character segments in education live broadcast, entertainment live broadcast or remote course video, news video, street view video, commodity video and entertainment video. By associating the text segments of two adjacent image frames in the video, text content corresponding to the video can be generated. Aiming at the character content obtained in the educational scene, the directory structure and the knowledge content in the video can be quickly converted into character information, so that the user can conveniently record and quickly search; aiming at the character segments obtained in the entertainment scene, the method can be used as a subtitle or special effect editing material; aiming at the condition that subtitles or special characters in the video are lost, the lost part of the contents can be supplemented, and the character segments displayed by the video frames are further updated.

An embodiment of applying video content extraction to a curriculum video scene is given as follows, and referring to fig. 6, a flowchart of a method for extracting curriculum content according to a fifth embodiment of the present application is shown, where the method specifically may include the following steps:

step 501, extracting image characteristics of character segments in image frames of a curriculum video;

502, determining similarity between image characteristics of character segments respectively corresponding to two adjacent image frames by performing similarity calculation on similar parts of the character segments in the two adjacent image frames;

step 503, combining the text segments in two adjacent image frames with similarity satisfying the set range, and obtaining the text content corresponding to the curriculum video.

According to the embodiment of the application, when the course video is subjected to content extraction, based on the relevance of the character segments displayed in the video, after the image features of the character segments in the image frames of the course video are extracted, the similarity between the image features of the character segments in two adjacent image frames is determined, and the character segments are further subjected to merging processing according to the similarity, so that the character content corresponding to the course video, namely the course content in a text form, is obtained, and the course video is convenient to use in follow-up learning.

According to the embodiment of the application, after the image characteristics of the character segments in the two adjacent image frames are extracted, the similarity between the image characteristics of the character segments in the two adjacent image frames is determined by performing similar calculation on the similar parts of the character segments in the two adjacent image frames, and the character segments are further combined according to the similarity, so that the character content corresponding to the two adjacent image frames is obtained. Because the similarity between the image characteristics is determined by performing similarity calculation on the similar parts of the character segments in the two adjacent image frames, namely performing alignment processing on the character segments in the two adjacent image frames after dynamic matching, and performing similarity calculation after the similar parts are corresponding, compared with a scheme of directly performing similarity calculation on the character segments, the obtained similarity more accurately represents the similarity between the character segments, and the sequence characteristics between the character segments do not need to be labeled, so that the image character segments can be better correlated, the extraction result of the video content is concise and complete, the requirement of the output content is met, and the use experience of a user is improved. Moreover, the method can better adapt to the sequence characteristics of partial overlapping between text segments of video frames in a video scene, and has stronger universality.

An embodiment of applying video content extraction to video subtitle extraction is given as follows, and referring to fig. 7, a flowchart of a subtitle extraction method according to a sixth embodiment of the present application is shown, where the method may specifically include the following steps:

step 601, respectively identifying image characteristics of subtitle fragments from image frames of a video;

step 602, determining similarity between image features of caption segments corresponding to two adjacent image frames respectively by performing similarity calculation on similar parts of character segments in the two adjacent image frames;

step 603, merging the caption segments in the two adjacent image frames with the similarity meeting the set range to obtain the caption content corresponding to the video.

In an alternative embodiment, the determining the similarity between the image features of the subtitle segments corresponding to the two adjacent image frames respectively by performing similarity calculation on similar portions of the text segments in the two adjacent image frames includes:

dividing the image characteristics into a plurality of sub-characteristics on the characteristic space width, and determining the similarity between the sub-characteristics of the character segments of two adjacent image frames;

and determining the similarity between the image characteristics of the character segments in the two adjacent image frames according to the corresponding relation of the associated sub-characteristics.

In an alternative embodiment, the merging the subtitle segments in two adjacent image frames with similarity satisfying the set range includes:

and performing de-duplication processing and/or filling processing on the character segments in the two adjacent image frames with the similarity meeting the set range.

According to the embodiment of the application, after the image features of the caption segments in the two adjacent image frames are extracted, the similarity between the image features of the caption segments in the two adjacent image frames is determined by performing similar calculation on the similar parts of the caption segments in the two adjacent image frames, and the caption segments are further merged according to the similarity, so that the character content corresponding to the two adjacent image frames is obtained. Because the similarity between the image characteristics is determined by performing similarity calculation on the similar parts of the caption segments in the two adjacent image frames, namely performing alignment processing after dynamically matching the caption segments in the two adjacent image frames, and performing similarity calculation after corresponding the similar parts, compared with a scheme of directly performing similarity calculation on the caption segments, the obtained similarity more accurately represents the similarity between the caption segments, and the sequence characteristics between the caption segments do not need to be labeled, so that the image caption segments can be better correlated, the extraction result of the video content is concise and complete, the requirement of the output content is met, and the use experience of a user is improved. In addition, the method can better adapt to the sequence characteristics of partial overlapping between subtitle fragments of video frames in a video scene, and has stronger universality.

With reference to fig. 8, a flowchart of a video content extraction method according to a seventh embodiment of the present application is shown, where the method specifically includes the following steps:

step 701, extracting image characteristics of character segments in image frames of a target video;

step 702, determining similarity between image characteristics of character segments in two adjacent image frames by performing similarity calculation on similar parts of the character segments in the two adjacent image frames;

and 703, merging the text segments in the two adjacent image frames according to the similarity to obtain text contents corresponding to the target video.

According to the embodiment of the application, after the image characteristics of the character segments in two adjacent image frames in the video are extracted, the similarity between the image characteristics of the character segments in the two adjacent image frames is determined by performing similar calculation on the similar parts of the character segments in the two adjacent image frames, and the character segments are further combined according to the similarity, so that the character content corresponding to the two adjacent image frames is obtained. Because the similarity between the image characteristics is determined by performing similarity calculation on the similar parts of the character segments in the two adjacent image frames, namely performing alignment processing on the character segments in the two adjacent image frames after dynamic matching, and performing similarity calculation after the similar parts are corresponding, compared with a scheme of directly performing similarity calculation on the character segments, the obtained similarity more accurately represents the similarity between the character segments, and the sequence characteristics between the character segments do not need to be labeled, so that the image character segments can be better correlated, the extraction result of the video content is concise and complete, the requirement of the output content is met, and the use experience of a user is improved. Moreover, the method can better adapt to the sequence characteristics of partial overlapping between text segments of video frames in a video scene, and has stronger universality.

An embodiment of applying video content extraction to text field processing in a video conference scene is given as follows, and referring to fig. 9, a flowchart of a conference content handler according to an eighth embodiment of the present application is shown, where the method specifically may include the following steps:

step 801, acquiring a conference video in real time;

step 802, extracting image characteristics of text segments in image frames of a conference video;

step 803, determining similarity between image features of the text segments respectively corresponding to two adjacent image frames by performing similarity calculation on similar parts of the text segments in the two adjacent image frames;

step 804, combining the character segments in two adjacent image frames with similarity meeting the set range;

step 805, adding the text content obtained after the merging process to the conference video as a subtitle;

step 806 provides the conference video after the subtitle is added.

Courseware, files and PPT content display may be displayed in the teleconference video, wherein the key information related to the teleconference content is included, so that text segments in the teleconference video can be extracted, and based on the relevance among the text segments of the image frames, the text segments are combined to be used as subtitles.

The conference video can be obtained in real time in a video conference, the image characteristics of the text segments in the image frames are further extracted, the similarity between the image characteristics of the text segments respectively corresponding to the two adjacent image frames is determined by performing similarity calculation on the similar parts of the text segments in the two adjacent image frames, if the similarity meets a set condition, the text segments of the two adjacent image frames are considered to be associated, and the text segments can be further combined, for example, the same text content is subjected to de-duplication, or the incompletely displayed parts are supplemented by the associated text segments. The combined text content is used as a subtitle and is added to the conference video, and the conference video with the subtitle is provided for the conference client side for the user to watch.

An embodiment of applying video content extraction to a remote video scene is given as follows, and referring to fig. 10, a flowchart of a remote video processing method according to a ninth embodiment of the present application is shown, where the method specifically may include the following steps:

step 901, identifying image characteristics of text segments in image frames of the telemedicine video;

step 902, determining similarity between image characteristics of the character segments respectively corresponding to two adjacent image frames by performing similarity calculation on similar parts of the character segments in the two adjacent image frames;

903, merging character segments in two adjacent image frames with similarity meeting a set range;

step 904, updating the text segments in the remote medical video frame according to the text content obtained after the merging;

step 905, providing the updated telemedicine video.

Telemedicine can be used for solving many medical problems that can not be handled closely, probably relate to the show of file, medicine or medical instrument's packing carton surface characters in the video, wherein the literal content is closely related with this telemedicine, because probably show complete inadequately or lack, it is significant to its extraction and processing according to the relevance of the characters fragment of a plurality of video frames, show succinctly more clear in comparing the image frame, be convenient for in time obtain medical key information, the user that supplies to participate in telemedicine video looks over when the video or looks over behind the video.

The image characteristics of the text segments in the image frames can be extracted from the telemedicine video, the similarity between the image characteristics of the text segments respectively corresponding to the two adjacent image frames is determined by performing similarity calculation on the similar parts of the text segments in the two adjacent image frames, if the similarity meets a set condition, the text segments of the two adjacent image frames are considered to be associated, and the text segments can be further combined, for example, the same text content is removed, or the parts which are not completely displayed in the text segments are supplemented by using the associated text segments. The text content obtained after the merging processing can be used for updating text segments in the video, and the updated video is provided for a remote medical video terminal for real-time viewing or viewing after the video.

The following embodiment provides a cloud interaction process in a text content extraction process in an image, and referring to fig. 11, a flowchart of a text content extraction method according to a tenth embodiment of the present application is shown, where the method specifically includes the following steps:

step 1001, acquiring at least two submitted images;

step 1002, extracting image characteristics of the character segments in the at least two images;

step 1003, determining similarity between image characteristics of the at least two image Chinese character segments by performing similarity calculation on similar parts of the at least two image Chinese character segments;

step 1004, combining the text segments in the at least two images according to the similarity to obtain text contents corresponding to the at least two images;

step 1005, providing the text content.

Terminal users such as a video platform (video software service provider), a subtitle/video/image editing professional, or ordinary users can access the cloud service platform through a set video interface and submit at least two images needing to extract text contents, the cloud service platform provides corresponding services of videos or images by utilizing self computing resources, image features of text segments in the images are extracted according to the submitted images, similarity calculation is carried out on similar parts of the text segments in the images, the similarity between the image features of the text segments in the images is determined, and further merging processing is carried out on the text segments in the images according to the similarity to obtain corresponding text contents. Referring to fig. 22, an interaction diagram of the terminal and the cloud service platform is shown, and after the terminal submits at least two images, the cloud service platform feeds back corresponding text contents to the terminal for the terminal user to view or perform the required personalized editing.

The interface parameters can define video sampling rate (for example, sampling once in 5 seconds, and acquiring one image frame once), resolution (image amplification or reduction processing can be performed according to processing efficiency and definition requirements), the cloud service platform can generate logs periodically, record data such as processing duration and resource occupation condition of each time, and can count video processing conditions of different video platforms or terminals, so that subsequent optimization of cloud service is facilitated. The feedback content can comprise the extracted text content, the position coordinates of the text segments in the image, and the starting or ending time of the text segments, and can be displayed in a segmented mode or distinguished by adding a spacer for the case that a plurality of text segments are included in one video frame.

Referring to fig. 12, a block diagram of a text content extracting apparatus according to an eleventh embodiment of the present application is shown, which may specifically include:

a feature extraction module 1101, configured to extract image features of text segments in at least two images;

a similarity determining module 1102, configured to determine similarity between image features of the text segments in the at least two images by performing similarity calculation on similar portions of the text segments in the at least two images;

a merging module 1103, configured to merge the text segments in the at least two images according to the similarity, so as to obtain text contents corresponding to the at least two images.

In an optional embodiment, the feature extraction module is specifically configured to divide an image area where the text segment is located into a plurality of character areas; and respectively extracting the image characteristics of the character areas, and combining the image characteristics of the character areas into the image characteristics of the character segments.

In an optional embodiment, the apparatus further comprises:

and the network training module is used for acquiring the associated image sample pair and training an image information analysis network for extracting the image characteristics of the character segments in the image according to the associated image sample pair.

In an alternative embodiment, the network training module includes:

the characteristic extraction submodule is used for respectively extracting the image characteristics of the Chinese character fragments of the associated image sample;

the similarity determining submodule is used for determining the similarity of the associated image sample to the image characteristics of the Chinese character segment by performing similarity calculation on the similar part of the Chinese character segment by the associated image sample;

and the similarity training submodule is used for training an image information analysis network according to the determined similarity aiming at the associated image sample pair and extracting the image characteristics of the character segments in the image.

In an optional embodiment, the similarity determination submodule includes:

the sub-feature dividing subunit is used for dividing the image features into a plurality of sub-features in the feature space width;

a similarity determining subunit, configured to determine a similarity between sub-features of text segments of at least two images;

the sub-feature association subunit is used for associating the sub-features comprising similar parts according to the similarity between the sub-features;

and the similarity determining subunit is used for determining the similarity between the image characteristics of the character segments in the at least two images according to the corresponding relation of the associated sub-characteristics.

In an optional embodiment, the sub-feature association subunit is specifically configured to construct a similarity matrix according to similarities between all sub-features of the text segments of the at least two images; and determining a minimum cost path of the similarity matrix, and taking the corresponding relation of the sub-features represented by the minimum cost path as the corresponding relation of the associated sub-features.

In an optional embodiment, the merging module is specifically configured to perform de-duplication processing and/or completion processing on text segments in at least two images with similarity satisfying a set range.

In an optional embodiment, the apparatus further comprises:

the content providing module is used for providing the obtained text content on the equipment interface;

and the content updating module is used for displaying the updated character content based on the editing operation of the obtained character content.

Referring to fig. 13, a block diagram of a processing device of an image information analysis network according to a twelfth embodiment of the present application is shown, where the block diagram specifically includes:

an image pair obtaining module 1201, configured to obtain a pair of associated image samples;

the image feature extraction module 1202 is further configured to extract image features of the Chinese character fragments of the associated image sample pairs respectively;

a similarity calculation module 1203, configured to determine similarity between image features of the Chinese character segments of the associated image samples by performing similarity calculation on similar portions of the Chinese character segments of the associated image samples;

a network training module 1204, configured to train an image information analysis network according to the determined similarity for the associated image sample pair, and configured to extract image features of the text segments in the image.

Referring to fig. 14, a block diagram of a processing device of an image information analysis network according to a thirteenth embodiment of the present application is shown, and specifically, the processing device may include:

a sample group obtaining module 1301, configured to obtain a ternary sample group including an associated image sample pair and a non-associated image sample pair;

an image feature extraction module 1302, configured to respectively extract image features of the text segments in the ternary sample group;

a similarity calculation module 1303, configured to determine similarity between image features of the Chinese character segments of the associated image samples by performing similarity calculation on similar portions of the Chinese character segments of the associated image samples;

a similarity calculation module 1304, configured to determine similarity between image features of the Chinese character segments of the non-associated image samples by performing similarity calculation on similar portions of the Chinese character segments of the non-associated image samples;

the network training module 1305 is configured to train an image information analysis network according to the similarity determined for the associated image sample pair and the similarity determined for the non-associated image sample pair, and is configured to extract image features of the text segments in the image.

Referring to fig. 15, a block diagram of an image retrieval apparatus according to a fourteenth embodiment of the present application is shown, which may specifically include:

a first image obtaining module 1401, configured to obtain a first image according to a search;

a feature extraction module 1402, configured to extract image features of the text segments in the first image;

a similarity calculation module 1403, configured to determine similarity between image features of text segments in the first image and the second image by performing similarity calculation on similar portions of text segments included in the first image and the second image, respectively;

an image association module 1404, configured to determine that the second image is an associated image of the first image according to the similarity;

a result providing module 1405, configured to provide the associated image as an image retrieval result.

Referring to fig. 16, a block diagram of a device for extracting curriculum contents according to a fifteenth embodiment of the present application is shown, which specifically includes:

the feature extraction module 1501 is configured to extract image features of text segments in image frames of the curriculum video;

a similarity calculation module 1502, configured to determine similarity between image features of text segments corresponding to two adjacent image frames by performing similarity calculation on similar portions of the text segments in the two adjacent image frames;

the merging module 1503 is configured to merge the text segments in the two adjacent image frames whose similarity satisfies the set range, so as to obtain the text content corresponding to the course video.

Referring to fig. 17, a block diagram illustrating a structure of a subtitle extracting apparatus according to a sixteenth embodiment of the present application is shown, which may specifically include:

a feature recognition module 1601, configured to recognize image features of subtitle segments from image frames of a target video, respectively;

a similarity calculation module 1602, configured to determine similarity between image features of subtitle segments corresponding to two adjacent image frames by performing similarity calculation on similar portions of text segments in the two adjacent image frames;

a merging module 1603, configured to merge the subtitle segments in two adjacent image frames with similarity meeting a set range to obtain subtitle content corresponding to the target video.

In an alternative embodiment, the similarity calculation module is specifically configured to divide the image features into a plurality of sub-features in the feature space width, and determine the similarity between the sub-features of the text segments of two adjacent image frames; according to the similarity between the sub-features, associating the sub-features comprising similar parts; and determining the similarity between the image characteristics of the character segments in the two adjacent image frames according to the corresponding relation of the associated sub-characteristics.

In an optional embodiment, the merging module is specifically configured to perform de-duplication processing and/or padding processing on text segments in two adjacent image frames with similarity satisfying a set range.

Referring to fig. 18, a block diagram of a video content extraction apparatus according to a seventeenth embodiment of the present application is shown, which specifically includes:

a feature extraction module 1701 for extracting image features of text segments in an image frame of a target video;

a similarity calculation module 1702, configured to determine similarity between image features of text segments in two adjacent image frames by performing similarity calculation on similar portions of the text segments in the two adjacent image frames;

a merging module 1703, configured to merge the text segments in the two adjacent image frames according to the similarity, so as to obtain text content corresponding to the target video.

Referring to fig. 19, a block diagram illustrating a structure of a conference content processing apparatus according to eighteenth embodiment of the present application may specifically include:

a video conference acquisition module 1801, configured to acquire a conference video in real time;

a feature extraction module 1802, configured to extract image features of text segments in an image frame of a conference video;

a similarity calculation module 1803, configured to determine similarity between image features of text segments corresponding to two adjacent image frames by performing similarity calculation on similar portions of the text segments in the two adjacent image frames;

a merging module 1804, configured to merge text segments in two adjacent image frames whose similarity satisfies a set range;

a caption adding module 1805, configured to add, as a caption, the text content obtained after the merging processing to the conference video;

a video providing module 1806, configured to provide the meeting video after adding the subtitle.

Referring to fig. 20, a block diagram of a remote video processing apparatus according to nineteenth embodiment of the present application is shown, which may specifically include:

a feature extraction module 1901, configured to identify image features of text segments in image frames of a telemedicine video;

a similarity calculation module 1902, configured to determine similarity between image features of text segments corresponding to two adjacent image frames by performing similarity calculation on similar portions of the text segments in the two adjacent image frames;

a merging module 1903, configured to merge text segments in two adjacent image frames whose similarity satisfies a set range;

a text updating module 1904, configured to update text segments in the remote medical video frame according to text contents obtained after merging;

a video providing module 1905 for providing updated telemedicine video.

Referring to fig. 21, a block diagram of a structure of a text content extracting apparatus according to a twentieth embodiment of the present application is shown, and specifically, the text content extracting apparatus may include:

an image acquisition module 2001 for acquiring the submitted at least two images;

a feature extraction module 2002, configured to extract image features of the text segments in the at least two images;

a similarity calculation module 2003, configured to determine similarity between image features of the at least two images of the text segments by performing similarity calculation on similar portions of the text segments in the at least two images;

a merging module 2004, configured to merge the text segments in the at least two images according to the similarity, so as to obtain text contents corresponding to the at least two images;

a content providing module 2005, configured to provide the text content.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Embodiments of the disclosure may be implemented as a system using any suitable hardware, firmware, software, or any combination thereof, in a desired configuration. Fig. 22 schematically illustrates an exemplary system (or apparatus) 2100 that can be used to implement various embodiments described in this disclosure.

For one embodiment, fig. 21 illustrates an exemplary system 2100 having one or more processors 2102, a system control module (chipset) 2104 coupled to at least one of the processor(s) 2102, a system memory 2106 coupled to the system control module 2104, a non-volatile memory (NVM)/storage device 2108 coupled to the system control module 2104, one or more input/output devices 2110 coupled to the system control module 2104, and a network interface 2112 coupled to the system control module 2106.

The processor 2102 may include one or more single-core or multi-core processors, and the processor 2102 may include any combination of general-purpose processors or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the system 2100 is capable of operating as a browser as described in embodiments herein.

In some embodiments, system 2100 may include one or more computer-readable media (e.g., system memory 2106 or NVM/storage 2108) having instructions and one or more processors 2102 that execute the instructions to implement modules to perform the actions described in this disclosure, in conjunction with the one or more computer-readable media.

For one embodiment, the system control module 2104 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 2102 and/or any suitable device or component in communication with the system control module 2104.

System control module 2104 may include a memory controller module to provide an interface to system memory 2106. The memory controller module may be a hardware module, a software module, and/or a firmware module.

System memory 2106 may be used to load and store data and/or instructions for system 2100, for example. For one embodiment, system memory 2106 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 2106 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 2104 may include one or more input/output controllers to provide an interface to NVM/storage 2108 and input/output device(s) 2110.

For example, NVM/storage 2108 may be used to store data and/or instructions. NVM/storage 2108 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 2108 may include storage resources that are physically part of the device on which system 2100 is installed or may be accessed by the device and not necessarily part of the device. For example, the NVM/storage 2108 may be accessed over a network via the input/output device(s) 2110.

The input/output device(s) 2110 may provide an interface for the system 2100 to communicate with any other suitable device, and the input/output devices 2110 may include communication components, audio components, sensor components, and so forth. Network interface 2112 may provide an interface for system 2100 to communicate over one or more networks, and system 2100 may communicate wirelessly with one or more components of a wireless network in accordance with any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, e.g., WiFi, 2G, 3G, 4G, or 5G, or a combination thereof.

For one embodiment, at least one of the processor(s) 2102 may be packaged together with logic for one or more controllers (e.g., memory controller module) of the system control module 2104. For one embodiment, at least one of the processor(s) 2102 may be packaged together with logic for one or more controller(s) of the system control module 2104 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 2102 may be integrated on the same die with logic for one or more controller(s) of the system control module 2104. For one embodiment, at least one of the processor(s) 2102 may be integrated on the same die with logic for one or more controller(s) of the system control module 2104 to form a system on a chip (SoC).

In various embodiments, system 2100 may be, but is not limited to being: a browser, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 2100 may have more or fewer components and/or different architectures. For example, in some embodiments, system 2100 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

Wherein, if the display includes a touch panel, the display screen may be implemented as a touch screen display to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also identify the duration and pressure associated with the touch or slide operation.

The present application further provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a terminal device, the one or more modules may cause the terminal device to execute instructions (instructions) of method steps in the present application.

In one example, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method according to the embodiments of the present application when executing the computer program.

There is also provided in one example a computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method as one or more of the embodiments of the application.

Although certain examples have been illustrated and described for purposes of description, a wide variety of alternate and/or equivalent implementations, or calculations, may be made to achieve the same objectives without departing from the scope of practice of the present application. This application is intended to cover any adaptations or variations of the embodiments discussed herein. Therefore, it is manifestly intended that the embodiments described herein be limited only by the claims and the equivalents thereof.

Claims

1. A method for extracting text contents is characterized by comprising the following steps:

extracting image characteristics of character segments in at least two images;

2. The method of claim 1, wherein extracting image features of text segments in at least two images comprises:

3. The method of claim 1, further comprising:

4. The method of claim 3, wherein the training of the image information parsing network for extracting image features of text segments in an image according to the associated image sample pairs comprises:

5. The method of claim 1, wherein determining the similarity between the image features of the text segments in the at least two images by performing a similarity calculation on similar portions of the text segments in the at least two images comprises:

6. The method of claim 5, wherein associating sub-features comprising similar parts according to similarities between the sub-features comprises:

7. The method of claim 1, wherein the merging the text segments in the at least two images according to the similarity comprises:

8. The method of claim 1, further comprising:

providing the obtained text content on an equipment interface;

9. A processing method of an image information analysis network is characterized by comprising the following steps:

acquiring a correlation image sample pair;

10. A processing method for an image information analysis network is characterized by comprising the following steps:

11. An image retrieval method, comprising:

acquiring a first image of a retrieval basis;

extracting image characteristics of the character segments in the first image;

and providing the associated image as an image retrieval result.

12. A method for extracting course content, comprising:

13. A subtitle extraction method is characterized by comprising the following steps:

14. The method of claim 13, wherein the determining the similarity between the image features of the caption segments respectively corresponding to the two adjacent image frames by performing similarity calculation on similar portions of the text segments in the two adjacent image frames comprises:

dividing the image features into a plurality of sub-features on the width of a feature space, and determining the similarity between the sub-features of the character segments of two adjacent image frames;

according to the similarity between the sub-features, correlating the sub-features comprising similar parts;

15. The method according to claim 13, wherein the merging the caption segments in two adjacent image frames whose similarity satisfies a set range comprises:

16. A method for extracting video content, comprising:

17. A conference content processing method, comprising:

acquiring a conference video in real time;

determining similarity between image characteristics of the text segments respectively corresponding to two adjacent image frames by performing similarity calculation on similar parts of the text segments in the two adjacent image frames;

adding the text content obtained after the merging processing to the conference video as a subtitle;

and providing the conference video added with the subtitles.

18. A remote video processing method, comprising:

updating the text segments in the telemedicine video frame according to the text content obtained after the merging;

providing the updated telemedicine video.

19. A method for extracting text contents is characterized by comprising the following steps:

acquiring at least two submitted images;

extracting image characteristics of the character segments in the at least two images;

combining the character fragments in the at least two images according to the similarity to obtain character contents corresponding to the at least two images;

and providing the text content.

20. An electronic device, comprising: a processor; and

a memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1-18.

21. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform the method of any of claims 1-18.