CN112699272A

CN112699272A - Information output method and device and electronic equipment

Info

Publication number: CN112699272A
Application number: CN202110015895.6A
Authority: CN
Inventors: 肖学锋; 林丽; 赵田雨
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-04-23
Anticipated expiration: 2041-01-06
Also published as: WO2022148239A1; CN112699272B

Abstract

The embodiment of the disclosure discloses an information output method, an information output device and electronic equipment. One embodiment of the method comprises: responding to the received search words searched aiming at the characters in the video, and acquiring a video text recognition result of the video; selecting words from the video text recognition results as target words based on the similarity between the search words and each word in the video text recognition results; acquiring video frames presenting target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and generating and outputting corrected text based on the search words and the text presented in the target video frame. The embodiment can accurately retrieve the correct result even when the text recognition result of the video has an error.

Description

Information output method and device and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to an information output method, an information output device and electronic equipment.

Background

At present, with the comprehensive development of informatization construction, a character recognition technology enters a mature stage of industrial application and development. In the process of identifying the characters in the video, the key frames are usually extracted from the video first, and then the characters in the key frames are identified. However, due to inaccuracy of extraction of the video key frames or errors generated by detection and identification, the final identification accuracy rate cannot be guaranteed to reach one hundred percent. Therefore, when retrieving a video text recognition result, how to accurately retrieve a correct result under the condition that the text recognition result is wrong is a problem to be solved urgently.

Disclosure of Invention

This disclosure is provided to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The embodiment of the disclosure provides an information output method, an information output device and electronic equipment, by which when a text recognition result in a video is retrieved, even if the text recognition result of the video is wrong, a correct result can be accurately retrieved and output.

In a first aspect, an embodiment of the present disclosure provides an information output method, where the method includes: responding to the received search words searched aiming at the characters in the video, and acquiring a video text recognition result of the video; selecting words from the video text recognition results as target words based on the similarity between the search words and each word in the video text recognition results; acquiring video frames presenting target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and generating and outputting corrected text based on the search words and the text presented in the target video frame.

In a second aspect, an embodiment of the present disclosure provides an information output apparatus, including: the device comprises an acquisition unit, a search unit and a processing unit, wherein the acquisition unit is used for responding to the received search words searched aiming at the characters in the video and acquiring the video text recognition result of the video; the first selection unit is used for selecting words from the video text recognition result as target words based on the similarity between the search words and the words in the video text recognition result; the second selection unit is used for acquiring the video frames presenting the target words and selecting the video frames meeting the preset conditions from the video frames as the target video frames; and the output unit is used for generating and outputting the corrected text based on the search words and the text presented in the target video frame.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the information output method according to the first aspect.

In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the steps of the information output method according to the first aspect.

The information output method, the information output device and the electronic equipment provided by the embodiment of the disclosure respond to the received search words searched for the characters in the video, and obtain the video text recognition result of the video; then, based on the similarity between the search word and each word in the video text recognition result, selecting a word from the video text recognition result as a target word; then, acquiring video frames presenting the target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and finally, generating a corrected text for output based on the search word and the text presented in the target video frame. When the text recognition result in the video is searched in this way, even if the text recognition result of the video has an error, the correct result can be accurately searched and output.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is an exemplary system architecture diagram in which various embodiments of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of an information output method according to the present disclosure;

FIG. 3 is a flow diagram for one embodiment of selecting words from video text recognition results as target words according to the information output method of the present disclosure;

FIG. 4 is a flow diagram for one embodiment of generating corrected text according to the information output method of the present disclosure;

FIG. 5 is a diagram illustrating one embodiment of a correspondence between search terms, recognized words, and text error types according to an information output method of the present disclosure;

FIG. 6 is a schematic block diagram of one embodiment of an information output device according to the present disclosure;

FIG. 7 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the information output methods of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

1011, 1012, 1013, a network 102, and a server 103. Network 102 is the medium used to provide communication links between

terminal devices

1011, 1012, 1013 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

1011, 1012, 1013 to interact with the server 103 via the network 102 to send or receive messages and the like, for example, the user may receive the corrected text output by the server 103 using the

terminal devices

1011, 1012, 1013, and the server 103 may also receive the search word input by the user using the

terminal devices

1011, 1012, 1013. Various communication client applications, such as a video processing application, a text recognition application, an instant messaging software, etc., may be installed on the

terminal devices

1011, 1012, 1013.

The

terminal devices

1011, 1012, 1013 may be hardware or software. When the

terminal devices

1011, 1012, 1013 are hardware, they may be various electronic devices having a display screen and supporting information interaction, including but not limited to smart cameras, smart phones, tablet computers, laptop computers, and the like. When the

terminal devices

1011, 1012, 1013 are software, they may be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 103 may be a server that provides various services. For example, a video text recognition result of a video can be obtained in response to receiving a search word that a user searches for characters in the video by using the

terminal device

1011, 1012, 1013; then, based on the similarity between the search word and each word in the video text recognition result, selecting a word from the video text recognition result as a target word; then, the video frames presenting the target words can be obtained, and the video frames meeting preset conditions are selected from the video frames to serve as the target video frames; finally, a corrected text may be generated based on the search term and the text presented in the target video frame, and the corrected text may be sent to the

terminal device

1011, 1012, 1013 for output.

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the information output method provided in the embodiment of the present application is generally executed by the server 103, and the information output apparatus is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of an information output method according to the present disclosure is shown. The information output method comprises the following steps:

step 201, in response to receiving a search word for searching for characters in a video, obtaining a video text recognition result of the video.

In the present embodiment, an execution subject of the information output method (e.g., a server shown in fig. 1) may determine whether a search word for searching for a text in a video is received. Here, the user may retrieve information related to the search term from the video in a search term retrieval manner, for example, characters in a video frame in which the search term is presented may be detected.

If a search word for searching for characters in the video is received, the execution main body can acquire a video text recognition result of the video. The video text recognition result may be a text recognized from the video by a text recognition method. As an example, the video text recognition result may be obtained by inputting the video into a pre-trained text recognition model.

It should be noted that the search term is usually an english search term, and the text presented in the video is also usually an english text.

And step 202, selecting words from the video text recognition results as target words based on the similarity between the search words and the words in the video text recognition results.

In this embodiment, the execution subject may select a word from the video text recognition result as a target word based on the similarity between the search word and each word in the video text recognition result. Specifically, the execution body may determine cosine similarity between the search word and each word in the video text recognition result; the execution subject may also determine a matrix similarity between the search term and each word in the video text recognition result; the execution main body may further input the search word and the word into a pre-trained character similarity determination model for each word in the video text recognition result, so as to obtain a similarity between the search word and the word.

It should be noted that the cosine similarity calculation method and the matrix similarity calculation method are well-known technologies that are widely researched and applied at present, and are not described herein again.

In this embodiment, the execution subject may select a word having the greatest similarity with the search word from the video text recognition result as a target word.

Step 203, acquiring the video frames presenting the target words, and selecting the video frames meeting the preset conditions from the video frames as the target video frames.

In this embodiment, the execution main body may acquire a video frame in which the target word is present. Here, since there is a case where a plurality of video frames present the target word, the acquired video frames in which the target word is present may be a plurality of frames.

Then, the execution subject may select a video frame meeting a preset condition from the video frames as a target video frame. The preset condition may be that the definition is the highest, that is, the execution main body may select a video frame with the highest definition from the video frames as the target video frame. Specifically, the executing entity may perform graying processing on the video frame, then perform laplace transform on the obtained grayscale image, then calculate a variance of the transformed image, and select a video frame with the largest variance value as the target video frame.

If only one frame of video frame has the target word, the execution body may determine that the video frame is the target video frame.

And step 204, generating a corrected text for outputting based on the search word and the text presented in the target video frame.

In this embodiment, the execution subject may generate and output a corrected text based on the search term and the text presented in the target video frame. The execution main body may replace the target word in the text presented in the target video frame with the search word, and output the corrected text.

The executing body may send the generated corrected text to the terminal device from which the search term originates, and the terminal device may present the received corrected text.

As an example, if the search word is mil, the text presented in the target video frame is driving mil is good for out health, and the target word in the text presented in the target video frame is mil, the corrected text generated may be driving mil is good for out health.

The method provided by the above embodiment of the present disclosure can accurately retrieve and output the correct result when the text recognition result of the video is retrieved, even if the text recognition result of the video is wrong.

In some optional implementations, the executing body may select, as the target video frame, a video frame that meets a preset condition from the video frames as follows: the execution subject may select a video frame with the highest confidence level from the video frames as a target video frame. The confidence level of a video frame may be determined as follows: each character in the video frame can be recognized according to the confidence coefficient recognition model to obtain the confidence coefficient of each character, and then the mean value of the confidence coefficients of each character can be obtained to serve as the confidence coefficient of the video frame. For example, for word, the confidence recognition model outputs the confidence of each character, the confidence of "w" is 0.99, the confidence of "o" is 0.89, the confidence of "r" is 0.95, the confidence of "d" is 0.95, and then the confidence of word is obtained by averaging the four confidences. Then, for other words except word in the video frame, the confidence of each word is obtained according to the way of obtaining the confidence of word, and then the confidence of all words is obtained by once averaging, so as to obtain the confidence of the video frame.

With further reference to FIG. 3, a flow 300 of one embodiment of an information output method for selecting a word from video text recognition results as a target word is illustrated. The process 300 for selecting words from video text recognition results includes the following steps:

step 301, determining an edit distance between the search word and each word in the video text recognition result as an initial distance.

In this embodiment, an execution subject of the information output method (e.g., a server shown in fig. 1) may determine an edit Distance (Levenshtein Distance) between the search word and each word in the video text recognition result as an initial Distance.

Let A and B be two strings, the minimum number of character operands required to convert string A to string B is called the edit distance of string A to string B. The character operation includes: deleting a character, inserting a character, and rewriting a character into another character. As an example, when the character string a is abc and the character string B is abf, the character c is only modified to the character f when the character string a is converted into the character string B, and therefore the edit distance from the character string a to the character string B is 1.

Step 302, determining a target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result.

In this embodiment, the execution body may determine the target distance between the search word and each word in the video text recognition result based on the initial distance determined in step 301, the character length of the search word, and the character length of each word in the video text recognition result.

Here, the execution main body may compare a character length of the search word with a character length of a word for each word in the video text recognition result. If the character length of the search word is less than or equal to the character length of the word, the initial distance between the search word and the word may be determined as the target distance between the search word and the word.

Step 303, adding words in the video text recognition result, the target distance between which and the search word is smaller than a preset target distance threshold value, into the candidate word set.

In this embodiment, the execution subject may add a word in the video text recognition result, in which a target distance between the word and the search word is smaller than a preset target distance threshold (e.g., 2), to the candidate word set.

Step 304, selecting a target word from the candidate word set based on the target distance.

In this embodiment, the execution subject may select a target word from the candidate word set based on a target distance between the search word and each word in the video text recognition result. Here, the execution main body may select a word having a smallest target distance from the search word from the candidate word set as a target word.

According to the method provided by the embodiment of the disclosure, the editing distance between the search word and each word in the video text recognition result is corrected to obtain the target distance according to the character length of the search word and the character length of each word in the video text recognition result, and then the target word is selected from the candidate word set based on the target distance.

In some alternative implementations, the executing body may determine the target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result by: for each word in the video text recognition result, the execution body may determine whether the character length of the search word is greater than the character length of the word. If the character length of the search word is greater than the character length of the word, a difference between the character length of the search word and the character length of the word may be determined, and then, a difference between an initial distance corresponding to the search word and the difference may be determined as a target distance between the search word and the word. For example, if the character length of the search word is Lq, the character length of the word is Ld, and the edit distance between the search word and the word is d0, the target distance d1 between the search word and the word is d0- (Lq-Ld).

In some alternative implementations, the executing entity may select the target word from the candidate word set based on the target distance by: the execution subject may select a word having a smallest target distance from the search word from the candidate word set as a first candidate word; then, a target word may be selected from the first word candidates based on the character length of the first word candidate.

In some alternative implementations, the execution subject may select the target word from the first candidate word based on the character length of the first candidate word by: the execution subject may determine whether there is a candidate word having a character length identical to that of the search word in the first candidate word; if a candidate word having the same character length as the search word exists in the first candidate words, the execution main body may select a candidate word having the same character length as the search word from the first candidate words as a target word.

In some alternative implementations, the execution subject may select the target word from the first candidate word based on the character length of the first candidate word by: the execution subject may determine whether there is a candidate word having a character length identical to that of the search word in the first candidate word; if there is no candidate word having the same character length as the search word among the first candidate words, the execution main body may select a candidate word having the longest character length from among the first candidate words as a target word.

With continued reference to FIG. 4, a flow 400 of one embodiment of an information output method to generate corrected text is shown. The process 400 for generating corrected text includes the following steps:

step 401 compares the character length of the target word in the text presented in the target video frame with the character length of the search term.

In the present embodiment, an execution subject of the information output method (e.g., a server shown in fig. 1) may compare the character length of the above-described target word in the text presented in the target video frame with the character length of the search word.

As an example, if the search word is "class" and the above-mentioned target word in the text presented in the target video frame is "cas", it may be determined that the character length of the target word "cas" is smaller than the character length of the search word "class".

Here, the target video frame may be a video frame having the highest sharpness among video frames in which the target word is present, or may be a video frame having the highest confidence among video frames in which the target word is present.

If the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, the executing main body may execute step 402.

Step 402, if the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, removing the target character string from the search word to obtain the residual character string.

In this embodiment, if it is determined in step 401 that the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, the execution main body may remove the target character string from the search word to obtain a remaining character string. The target string may be the target word in text presented in the target video frame.

As an example, if the search word is "Armstrong" and the above target word is "strong", the character string "strong" of the target word may be removed in the search word of "Armstrong", and the remaining character string "Arm" may be obtained.

At step 403, edit distances between the remaining character strings and words other than the target word in the text presented in the target video frame are determined.

In this embodiment, the execution subject may determine an edit distance between the remaining character string and a word other than the target word in the text presented in the target video frame.

As an example, the text presented in the target video frame is "Arm strong is a name", the remaining character string is "Arm", and the execution subject may determine the edit distance between the remaining character string "Arm" and "Arm", "is", "a", "name", respectively.

Step 404, if the position of the other word with the minimum editing distance to the remaining character strings in the target video frame is adjacent to the position of the target word in the target video frame, deleting the other word with the minimum editing distance to the remaining character strings in the text presented in the target video frame, replacing the target word in the deleted text with the search word, and outputting the corrected text.

In this embodiment, the execution subject may determine whether a position of the other word having the smallest edit distance with respect to the remaining character string in the target video frame is adjacent to a position of the target word in the target video frame. If the position of the word having the smallest edit distance from the remaining character string in the target video frame is adjacent to the position of the word in the target video frame, the execution main body may delete the word having the smallest edit distance from the remaining character string in the text presented in the target video frame.

For example, if the execution subject determines that the edit distance between the remaining character string "Arm" and "Arm" in the text "Arm string is a name" presented in the target video frame is minimum, and the position of "Arm" in the text presented in the target video frame and the position of the target word "string" in the text presented in the target video frame are adjacent, the execution subject may delete "Arm" in "Arm string is a name" to obtain the deleted text "string is a name".

And then, the execution main body can replace the target word in the deleted text by using the search word to obtain and output the corrected text.

As an example, if the search word is "Armstrong" and the deleted text is "strong a name", the execution subject may replace the target word "strong" in the deleted text "strong is a name" with the search word "Armstrong" to obtain the corrected text "Armstrong is a name".

The method provided by the above embodiment of the present disclosure provides a text correction manner when the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, so that the corrected text can be generated more accurately.

As shown in FIG. 5, FIG. 5 illustrates a diagram of one embodiment of a correspondence between search terms, recognized words, and types of text errors. In fig. 5, if the search word is "mil" and the recognized word is "mil", the text error type is tail missing. If the search word is "300" and the recognized word is "000", the text error type is head lost. If the search word is "Armstrong" and the recognized word is "Arm strong", the text error type is broken in the middle. If the search word is "class" and the recognized word is "cas", the text error type is middle lost. If the search word is "state" and the recognized word is "stop", the text error type is a character recognition error. If the search term is "wer," the recognized word is "werthe," and the text error type is word-sticky or space not split. If the search word is "up" and the recognized word is "upl", the text error type is partial character multi-recognition.

In some alternative implementations, the executing entity may generate and output the corrected text based on the search term and the text presented in the target video frame by: the execution subject may compare a character length of the target word in a text presented in the target video frame with a character length of the search word; if the character length of the target word in the text presented in the target video frame is greater than or equal to the character length of the search word, the execution main body may replace the target word in the text presented in the target video frame with the search word, so as to obtain a corrected text, and output the corrected text.

As an example, if the search word is "up", the text presented in the target video frame is "He stop up and wet to the window", and the target word in the text presented in the target video frame is "upl", the executing body may replace "upl" in the text "He stop up and wet to the window" presented in the target video frame with the search word "up", so as to obtain the corrected text "He stop up and wet to the window".

With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an information output apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the information output apparatus 600 of the present embodiment includes: an acquisition unit 601, a first selection unit 602, a second selection unit 603, and an output unit 604. The acquiring unit 601 is configured to acquire a video text recognition result of a video in response to receiving a search word for searching for a text in the video; a first selecting unit 602, configured to select a word from the video text recognition result as a target word based on similarity between the search word and each word in the video text recognition result; the second selecting unit 603 is configured to obtain a video frame in which the target word is present, and select a video frame meeting a preset condition from the video frames as a target video frame; the output unit 604 is configured to generate and output a corrected text based on the search term and the text presented in the target video frame.

In this embodiment, specific processing of the acquisition unit 601, the first selection unit 602, the second selection unit 603, and the output unit 604 of the information output apparatus 600 may refer to step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2.

In some alternative implementations, the first selecting unit 602 may be further configured to select a word from the video text recognition result as a target word based on the similarity between the search word and each word in the video text recognition result by: the first selecting unit 602 may determine an editing distance between the search word and each word in the video text recognition result as an initial distance; then, a target distance between the search word and each word in the video text recognition result can be determined based on the initial distance, the character length of the search word and the character length of each word in the video text recognition result; then, words in the video text recognition result, the target distance between which and the search word is smaller than a preset target distance threshold value, may be added to the candidate word set; finally, a target word may be selected from the set of candidate words based on the target distance.

In some alternative implementations, the first selecting unit 602 may be further configured to determine the target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result by: for each word in the video text recognition result, the first selecting unit 602 may determine whether the character length of the search word is greater than the character length of the word, and if so, may determine a difference between the character length of the search word and the character length of the word, and may determine a difference between an initial distance corresponding to the search word and the difference as a target distance between the search word and the word.

In some alternative implementations, the first selecting unit 602 may be further configured to select a target word from the candidate word set based on the target distance by: the first selecting unit 602 may select a word having a smallest target distance from the search word from the candidate word set as a first candidate word; then, a target word may be selected from the first word candidates based on the character length of the first word candidate.

In some alternative implementations, the first selecting unit 602 may be further configured to select a target word from the first candidate words based on the character length of the first candidate words by: the first selecting unit 602 may determine whether a candidate word having a character length equal to that of the search word exists in the first candidate word; if the candidate word exists, the candidate word with the same character length as the search word can be selected from the first candidate words to be used as the target word.

In some alternative implementations, the first selecting unit 602 may be further configured to select a target word from the first candidate words based on the character length of the first candidate words by: the first selecting unit 602 may determine whether a candidate word having a character length equal to that of the search word exists in the first candidate word; if not, the candidate word with the longest character length can be selected from the first candidate words as the target word.

In some alternative implementations, the output unit 604 may be further configured to generate and output a corrected text based on the search term and the text presented in the target video frame by: the output unit 604 may compare the character length of the target word in the text presented in the target video frame with the character length of the search word; if the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, the output unit 604 may remove a target character string from the search word to obtain a remaining character string, where the target character string may be the target word in the text presented in the target video frame; then, the editing distance between the residual character string and other words except the target word in the text presented in the target video frame can be determined; if the position of the word having the smallest edit distance to the remaining character string in the target video frame is adjacent to the position of the target word in the target video frame, output section 604 may delete the word having the smallest edit distance to the remaining character string in the text presented in the target video frame, replace the target word in the deleted text with the search word, and output the corrected text.

In some alternative implementations, the output unit 604 may be further configured to generate and output a corrected text based on the search term and the text presented in the target video frame by: the output unit 604 may compare the character length of the target word in the text presented in the target video frame with the character length of the search word; if the character length of the target word in the text presented in the target video frame is greater than or equal to the character length of the search word, the target word in the text presented in the target video frame can be replaced by the search word, and the corrected text is output.

In some optional implementations, the second selecting unit 603 may be further configured to select, as the target video frame, a video frame meeting a preset condition from the video frames as follows: the second selecting unit 603 may select a video frame with the highest confidence level from the video frames as a target video frame.

Referring now to FIG. 7, a block diagram of an electronic device (e.g., the server of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the server; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to the received search words searched aiming at the characters in the video, and acquiring a video text recognition result of the video; selecting words from the video text recognition results as target words based on the similarity between the search words and each word in the video text recognition results; acquiring video frames presenting target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and generating and outputting corrected text based on the search words and the text presented in the target video frame.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first selection unit, a second selection unit, and an output unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquiring unit may also be described as a "unit that acquires a video text recognition result of a video in response to receiving a search word for searching for a text in the video".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. An information output method, comprising:

responding to a received search word for searching characters in a video, and acquiring a video text recognition result of the video;

selecting words from the video text recognition results as target words based on the similarity of the search words and the words in the video text recognition results;

acquiring video frames presenting the target words, and selecting video frames meeting preset conditions from the video frames as target video frames;

and generating and outputting corrected text based on the search word and the text presented in the target video frame.

2. The method of claim 1, wherein the selecting words from the video text recognition results as target words based on the similarity of the search word to each word in the video text recognition results comprises:

determining an editing distance between the search word and each word in the video text recognition result as an initial distance;

determining a target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result;

adding words in the video text recognition result, wherein the target distance between the words and the search word is smaller than a preset target distance threshold value, into a candidate word set;

and selecting a target word from the candidate word set based on the target distance.

3. The method of claim 2, wherein determining the target distance between the search term and each word in the video text recognition result based on the initial distance, the character length of the search term, and the character length of each word in the video text recognition result comprises:

and determining whether the character length of the search word is greater than the character length of the word or not for each word in the video text recognition result, if so, determining the difference value between the character length of the search word and the character length of the word, and determining the difference value between the initial distance corresponding to the search word and the difference value as the target distance between the search word and the word.

4. The method of claim 2, wherein the selecting a target word from the set of candidate words based on the target distance comprises:

selecting a word with the minimum target distance to the search word from the candidate word set as a first candidate word;

and selecting a target word from the first candidate words based on the character length of the first candidate words.

5. The method of claim 4, wherein selecting a target word from the first word candidate based on the character length of the first word candidate comprises:

determining whether a candidate word having the same character length as the search word exists in the first candidate words;

and if so, selecting a candidate word with the same character length as the search word from the first candidate word as a target word.

6. The method of claim 4, wherein selecting a target word from the first word candidate based on the character length of the first word candidate comprises:

and if the candidate word does not exist, selecting the candidate word with the longest character length from the first candidate words as the target word.

7. The method of claim 1, wherein generating the corrected text for output based on the search terms and the text presented in the target video frame comprises:

comparing a character length of the target word in text presented in the target video frame to a character length of the search term;

if the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, removing the target character string in the search word to obtain a residual character string, wherein the target character string is the target word in the text presented in the target video frame;

determining an edit distance between the remaining character string and words other than the target word in text presented in the target video frame;

and if the position of the other word with the minimum editing distance with the residual character string in the target video frame is adjacent to the position of the target word in the target video frame, deleting the other word with the minimum editing distance with the residual character string in the text presented in the target video frame, replacing the target word in the deleted text by using the search word, and outputting the corrected text.

8. The method of claim 1, wherein generating the corrected text for output based on the search terms and the text presented in the target video frame comprises:

and if the character length of the target word in the text presented in the target video frame is greater than or equal to the character length of the search word, replacing the target word in the text presented in the target video frame by using the search word to obtain a corrected text and outputting the corrected text.

9. The method according to any one of claims 1 to 8, wherein the selecting a video frame meeting a preset condition from the video frames as a target video frame comprises:

and selecting the video frame with the maximum confidence level from the video frames as a target video frame.

10. An information output apparatus, characterized by comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for responding to the received search words searched aiming at the characters in the video and acquiring the video text recognition result of the video;

the first selection unit is used for selecting words from the video text recognition result as target words based on the similarity between the search words and the words in the video text recognition result;

the second selection unit is used for acquiring the video frames presenting the target words and selecting the video frames meeting the preset conditions from the video frames as target video frames;

and the output unit is used for generating and outputting the corrected text based on the search word and the text presented in the target video frame.

11. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.