CN112699272A - Information output method and device and electronic equipment - Google Patents

Information output method and device and electronic equipment Download PDF

Info

Publication number
CN112699272A
CN112699272A CN202110015895.6A CN202110015895A CN112699272A CN 112699272 A CN112699272 A CN 112699272A CN 202110015895 A CN202110015895 A CN 202110015895A CN 112699272 A CN112699272 A CN 112699272A
Authority
CN
China
Prior art keywords
word
target
video
text
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110015895.6A
Other languages
Chinese (zh)
Other versions
CN112699272B (en
Inventor
肖学锋
林丽
赵田雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202110015895.6A priority Critical patent/CN112699272B/en
Publication of CN112699272A publication Critical patent/CN112699272A/en
Priority to PCT/CN2021/140160 priority patent/WO2022148239A1/en
Application granted granted Critical
Publication of CN112699272B publication Critical patent/CN112699272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure discloses an information output method, an information output device and electronic equipment. One embodiment of the method comprises: responding to the received search words searched aiming at the characters in the video, and acquiring a video text recognition result of the video; selecting words from the video text recognition results as target words based on the similarity between the search words and each word in the video text recognition results; acquiring video frames presenting target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and generating and outputting corrected text based on the search words and the text presented in the target video frame. The embodiment can accurately retrieve the correct result even when the text recognition result of the video has an error.

Description

Information output method and device and electronic equipment
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to an information output method, an information output device and electronic equipment.
Background
At present, with the comprehensive development of informatization construction, a character recognition technology enters a mature stage of industrial application and development. In the process of identifying the characters in the video, the key frames are usually extracted from the video first, and then the characters in the key frames are identified. However, due to inaccuracy of extraction of the video key frames or errors generated by detection and identification, the final identification accuracy rate cannot be guaranteed to reach one hundred percent. Therefore, when retrieving a video text recognition result, how to accurately retrieve a correct result under the condition that the text recognition result is wrong is a problem to be solved urgently.
Disclosure of Invention
This disclosure is provided to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The embodiment of the disclosure provides an information output method, an information output device and electronic equipment, by which when a text recognition result in a video is retrieved, even if the text recognition result of the video is wrong, a correct result can be accurately retrieved and output.
In a first aspect, an embodiment of the present disclosure provides an information output method, where the method includes: responding to the received search words searched aiming at the characters in the video, and acquiring a video text recognition result of the video; selecting words from the video text recognition results as target words based on the similarity between the search words and each word in the video text recognition results; acquiring video frames presenting target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and generating and outputting corrected text based on the search words and the text presented in the target video frame.
In a second aspect, an embodiment of the present disclosure provides an information output apparatus, including: the device comprises an acquisition unit, a search unit and a processing unit, wherein the acquisition unit is used for responding to the received search words searched aiming at the characters in the video and acquiring the video text recognition result of the video; the first selection unit is used for selecting words from the video text recognition result as target words based on the similarity between the search words and the words in the video text recognition result; the second selection unit is used for acquiring the video frames presenting the target words and selecting the video frames meeting the preset conditions from the video frames as the target video frames; and the output unit is used for generating and outputting the corrected text based on the search words and the text presented in the target video frame.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the information output method according to the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the steps of the information output method according to the first aspect.
The information output method, the information output device and the electronic equipment provided by the embodiment of the disclosure respond to the received search words searched for the characters in the video, and obtain the video text recognition result of the video; then, based on the similarity between the search word and each word in the video text recognition result, selecting a word from the video text recognition result as a target word; then, acquiring video frames presenting the target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and finally, generating a corrected text for output based on the search word and the text presented in the target video frame. When the text recognition result in the video is searched in this way, even if the text recognition result of the video has an error, the correct result can be accurately searched and output.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is an exemplary system architecture diagram in which various embodiments of the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of an information output method according to the present disclosure;
FIG. 3 is a flow diagram for one embodiment of selecting words from video text recognition results as target words according to the information output method of the present disclosure;
FIG. 4 is a flow diagram for one embodiment of generating corrected text according to the information output method of the present disclosure;
FIG. 5 is a diagram illustrating one embodiment of a correspondence between search terms, recognized words, and text error types according to an information output method of the present disclosure;
FIG. 6 is a schematic block diagram of one embodiment of an information output device according to the present disclosure;
FIG. 7 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the information output methods of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 1011, 1012, 1013, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 1011, 1012, 1013 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 1011, 1012, 1013 to interact with the server 103 via the network 102 to send or receive messages and the like, for example, the user may receive the corrected text output by the server 103 using the terminal devices 1011, 1012, 1013, and the server 103 may also receive the search word input by the user using the terminal devices 1011, 1012, 1013. Various communication client applications, such as a video processing application, a text recognition application, an instant messaging software, etc., may be installed on the terminal devices 1011, 1012, 1013.
The terminal devices 1011, 1012, 1013 may be hardware or software. When the terminal devices 1011, 1012, 1013 are hardware, they may be various electronic devices having a display screen and supporting information interaction, including but not limited to smart cameras, smart phones, tablet computers, laptop computers, and the like. When the terminal devices 1011, 1012, 1013 are software, they may be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 103 may be a server that provides various services. For example, a video text recognition result of a video can be obtained in response to receiving a search word that a user searches for characters in the video by using the terminal device 1011, 1012, 1013; then, based on the similarity between the search word and each word in the video text recognition result, selecting a word from the video text recognition result as a target word; then, the video frames presenting the target words can be obtained, and the video frames meeting preset conditions are selected from the video frames to serve as the target video frames; finally, a corrected text may be generated based on the search term and the text presented in the target video frame, and the corrected text may be sent to the terminal device 1011, 1012, 1013 for output.
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the information output method provided in the embodiment of the present application is generally executed by the server 103, and the information output apparatus is generally disposed in the server 103.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of an information output method according to the present disclosure is shown. The information output method comprises the following steps:
step 201, in response to receiving a search word for searching for characters in a video, obtaining a video text recognition result of the video.
In the present embodiment, an execution subject of the information output method (e.g., a server shown in fig. 1) may determine whether a search word for searching for a text in a video is received. Here, the user may retrieve information related to the search term from the video in a search term retrieval manner, for example, characters in a video frame in which the search term is presented may be detected.
If a search word for searching for characters in the video is received, the execution main body can acquire a video text recognition result of the video. The video text recognition result may be a text recognized from the video by a text recognition method. As an example, the video text recognition result may be obtained by inputting the video into a pre-trained text recognition model.
It should be noted that the search term is usually an english search term, and the text presented in the video is also usually an english text.
And step 202, selecting words from the video text recognition results as target words based on the similarity between the search words and the words in the video text recognition results.
In this embodiment, the execution subject may select a word from the video text recognition result as a target word based on the similarity between the search word and each word in the video text recognition result. Specifically, the execution body may determine cosine similarity between the search word and each word in the video text recognition result; the execution subject may also determine a matrix similarity between the search term and each word in the video text recognition result; the execution main body may further input the search word and the word into a pre-trained character similarity determination model for each word in the video text recognition result, so as to obtain a similarity between the search word and the word.
It should be noted that the cosine similarity calculation method and the matrix similarity calculation method are well-known technologies that are widely researched and applied at present, and are not described herein again.
In this embodiment, the execution subject may select a word having the greatest similarity with the search word from the video text recognition result as a target word.
Step 203, acquiring the video frames presenting the target words, and selecting the video frames meeting the preset conditions from the video frames as the target video frames.
In this embodiment, the execution main body may acquire a video frame in which the target word is present. Here, since there is a case where a plurality of video frames present the target word, the acquired video frames in which the target word is present may be a plurality of frames.
Then, the execution subject may select a video frame meeting a preset condition from the video frames as a target video frame. The preset condition may be that the definition is the highest, that is, the execution main body may select a video frame with the highest definition from the video frames as the target video frame. Specifically, the executing entity may perform graying processing on the video frame, then perform laplace transform on the obtained grayscale image, then calculate a variance of the transformed image, and select a video frame with the largest variance value as the target video frame.
If only one frame of video frame has the target word, the execution body may determine that the video frame is the target video frame.
And step 204, generating a corrected text for outputting based on the search word and the text presented in the target video frame.
In this embodiment, the execution subject may generate and output a corrected text based on the search term and the text presented in the target video frame. The execution main body may replace the target word in the text presented in the target video frame with the search word, and output the corrected text.
The executing body may send the generated corrected text to the terminal device from which the search term originates, and the terminal device may present the received corrected text.
As an example, if the search word is mil, the text presented in the target video frame is driving mil is good for out health, and the target word in the text presented in the target video frame is mil, the corrected text generated may be driving mil is good for out health.
The method provided by the above embodiment of the present disclosure can accurately retrieve and output the correct result when the text recognition result of the video is retrieved, even if the text recognition result of the video is wrong.
In some optional implementations, the executing body may select, as the target video frame, a video frame that meets a preset condition from the video frames as follows: the execution subject may select a video frame with the highest confidence level from the video frames as a target video frame. The confidence level of a video frame may be determined as follows: each character in the video frame can be recognized according to the confidence coefficient recognition model to obtain the confidence coefficient of each character, and then the mean value of the confidence coefficients of each character can be obtained to serve as the confidence coefficient of the video frame. For example, for word, the confidence recognition model outputs the confidence of each character, the confidence of "w" is 0.99, the confidence of "o" is 0.89, the confidence of "r" is 0.95, the confidence of "d" is 0.95, and then the confidence of word is obtained by averaging the four confidences. Then, for other words except word in the video frame, the confidence of each word is obtained according to the way of obtaining the confidence of word, and then the confidence of all words is obtained by once averaging, so as to obtain the confidence of the video frame.
With further reference to FIG. 3, a flow 300 of one embodiment of an information output method for selecting a word from video text recognition results as a target word is illustrated. The process 300 for selecting words from video text recognition results includes the following steps:
step 301, determining an edit distance between the search word and each word in the video text recognition result as an initial distance.
In this embodiment, an execution subject of the information output method (e.g., a server shown in fig. 1) may determine an edit Distance (Levenshtein Distance) between the search word and each word in the video text recognition result as an initial Distance.
Let A and B be two strings, the minimum number of character operands required to convert string A to string B is called the edit distance of string A to string B. The character operation includes: deleting a character, inserting a character, and rewriting a character into another character. As an example, when the character string a is abc and the character string B is abf, the character c is only modified to the character f when the character string a is converted into the character string B, and therefore the edit distance from the character string a to the character string B is 1.
Step 302, determining a target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result.
In this embodiment, the execution body may determine the target distance between the search word and each word in the video text recognition result based on the initial distance determined in step 301, the character length of the search word, and the character length of each word in the video text recognition result.
Here, the execution main body may compare a character length of the search word with a character length of a word for each word in the video text recognition result. If the character length of the search word is less than or equal to the character length of the word, the initial distance between the search word and the word may be determined as the target distance between the search word and the word.
Step 303, adding words in the video text recognition result, the target distance between which and the search word is smaller than a preset target distance threshold value, into the candidate word set.
In this embodiment, the execution subject may add a word in the video text recognition result, in which a target distance between the word and the search word is smaller than a preset target distance threshold (e.g., 2), to the candidate word set.
Step 304, selecting a target word from the candidate word set based on the target distance.
In this embodiment, the execution subject may select a target word from the candidate word set based on a target distance between the search word and each word in the video text recognition result. Here, the execution main body may select a word having a smallest target distance from the search word from the candidate word set as a target word.
According to the method provided by the embodiment of the disclosure, the editing distance between the search word and each word in the video text recognition result is corrected to obtain the target distance according to the character length of the search word and the character length of each word in the video text recognition result, and then the target word is selected from the candidate word set based on the target distance.
In some alternative implementations, the executing body may determine the target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result by: for each word in the video text recognition result, the execution body may determine whether the character length of the search word is greater than the character length of the word. If the character length of the search word is greater than the character length of the word, a difference between the character length of the search word and the character length of the word may be determined, and then, a difference between an initial distance corresponding to the search word and the difference may be determined as a target distance between the search word and the word. For example, if the character length of the search word is Lq, the character length of the word is Ld, and the edit distance between the search word and the word is d0, the target distance d1 between the search word and the word is d0- (Lq-Ld).
In some alternative implementations, the executing entity may select the target word from the candidate word set based on the target distance by: the execution subject may select a word having a smallest target distance from the search word from the candidate word set as a first candidate word; then, a target word may be selected from the first word candidates based on the character length of the first word candidate.
In some alternative implementations, the execution subject may select the target word from the first candidate word based on the character length of the first candidate word by: the execution subject may determine whether there is a candidate word having a character length identical to that of the search word in the first candidate word; if a candidate word having the same character length as the search word exists in the first candidate words, the execution main body may select a candidate word having the same character length as the search word from the first candidate words as a target word.
In some alternative implementations, the execution subject may select the target word from the first candidate word based on the character length of the first candidate word by: the execution subject may determine whether there is a candidate word having a character length identical to that of the search word in the first candidate word; if there is no candidate word having the same character length as the search word among the first candidate words, the execution main body may select a candidate word having the longest character length from among the first candidate words as a target word.
With continued reference to FIG. 4, a flow 400 of one embodiment of an information output method to generate corrected text is shown. The process 400 for generating corrected text includes the following steps:
step 401 compares the character length of the target word in the text presented in the target video frame with the character length of the search term.
In the present embodiment, an execution subject of the information output method (e.g., a server shown in fig. 1) may compare the character length of the above-described target word in the text presented in the target video frame with the character length of the search word.
As an example, if the search word is "class" and the above-mentioned target word in the text presented in the target video frame is "cas", it may be determined that the character length of the target word "cas" is smaller than the character length of the search word "class".
Here, the target video frame may be a video frame having the highest sharpness among video frames in which the target word is present, or may be a video frame having the highest confidence among video frames in which the target word is present.
If the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, the executing main body may execute step 402.
Step 402, if the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, removing the target character string from the search word to obtain the residual character string.
In this embodiment, if it is determined in step 401 that the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, the execution main body may remove the target character string from the search word to obtain a remaining character string. The target string may be the target word in text presented in the target video frame.
As an example, if the search word is "Armstrong" and the above target word is "strong", the character string "strong" of the target word may be removed in the search word of "Armstrong", and the remaining character string "Arm" may be obtained.
At step 403, edit distances between the remaining character strings and words other than the target word in the text presented in the target video frame are determined.
In this embodiment, the execution subject may determine an edit distance between the remaining character string and a word other than the target word in the text presented in the target video frame.
As an example, the text presented in the target video frame is "Arm strong is a name", the remaining character string is "Arm", and the execution subject may determine the edit distance between the remaining character string "Arm" and "Arm", "is", "a", "name", respectively.
Step 404, if the position of the other word with the minimum editing distance to the remaining character strings in the target video frame is adjacent to the position of the target word in the target video frame, deleting the other word with the minimum editing distance to the remaining character strings in the text presented in the target video frame, replacing the target word in the deleted text with the search word, and outputting the corrected text.
In this embodiment, the execution subject may determine whether a position of the other word having the smallest edit distance with respect to the remaining character string in the target video frame is adjacent to a position of the target word in the target video frame. If the position of the word having the smallest edit distance from the remaining character string in the target video frame is adjacent to the position of the word in the target video frame, the execution main body may delete the word having the smallest edit distance from the remaining character string in the text presented in the target video frame.
For example, if the execution subject determines that the edit distance between the remaining character string "Arm" and "Arm" in the text "Arm string is a name" presented in the target video frame is minimum, and the position of "Arm" in the text presented in the target video frame and the position of the target word "string" in the text presented in the target video frame are adjacent, the execution subject may delete "Arm" in "Arm string is a name" to obtain the deleted text "string is a name".
And then, the execution main body can replace the target word in the deleted text by using the search word to obtain and output the corrected text.
As an example, if the search word is "Armstrong" and the deleted text is "strong a name", the execution subject may replace the target word "strong" in the deleted text "strong is a name" with the search word "Armstrong" to obtain the corrected text "Armstrong is a name".
The method provided by the above embodiment of the present disclosure provides a text correction manner when the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, so that the corrected text can be generated more accurately.
As shown in FIG. 5, FIG. 5 illustrates a diagram of one embodiment of a correspondence between search terms, recognized words, and types of text errors. In fig. 5, if the search word is "mil" and the recognized word is "mil", the text error type is tail missing. If the search word is "300" and the recognized word is "000", the text error type is head lost. If the search word is "Armstrong" and the recognized word is "Arm strong", the text error type is broken in the middle. If the search word is "class" and the recognized word is "cas", the text error type is middle lost. If the search word is "state" and the recognized word is "stop", the text error type is a character recognition error. If the search term is "wer," the recognized word is "werthe," and the text error type is word-sticky or space not split. If the search word is "up" and the recognized word is "upl", the text error type is partial character multi-recognition.
In some alternative implementations, the executing entity may generate and output the corrected text based on the search term and the text presented in the target video frame by: the execution subject may compare a character length of the target word in a text presented in the target video frame with a character length of the search word; if the character length of the target word in the text presented in the target video frame is greater than or equal to the character length of the search word, the execution main body may replace the target word in the text presented in the target video frame with the search word, so as to obtain a corrected text, and output the corrected text.
As an example, if the search word is "up", the text presented in the target video frame is "He stop up and wet to the window", and the target word in the text presented in the target video frame is "upl", the executing body may replace "upl" in the text "He stop up and wet to the window" presented in the target video frame with the search word "up", so as to obtain the corrected text "He stop up and wet to the window".
With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an information output apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 6, the information output apparatus 600 of the present embodiment includes: an acquisition unit 601, a first selection unit 602, a second selection unit 603, and an output unit 604. The acquiring unit 601 is configured to acquire a video text recognition result of a video in response to receiving a search word for searching for a text in the video; a first selecting unit 602, configured to select a word from the video text recognition result as a target word based on similarity between the search word and each word in the video text recognition result; the second selecting unit 603 is configured to obtain a video frame in which the target word is present, and select a video frame meeting a preset condition from the video frames as a target video frame; the output unit 604 is configured to generate and output a corrected text based on the search term and the text presented in the target video frame.
In this embodiment, specific processing of the acquisition unit 601, the first selection unit 602, the second selection unit 603, and the output unit 604 of the information output apparatus 600 may refer to step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2.
In some alternative implementations, the first selecting unit 602 may be further configured to select a word from the video text recognition result as a target word based on the similarity between the search word and each word in the video text recognition result by: the first selecting unit 602 may determine an editing distance between the search word and each word in the video text recognition result as an initial distance; then, a target distance between the search word and each word in the video text recognition result can be determined based on the initial distance, the character length of the search word and the character length of each word in the video text recognition result; then, words in the video text recognition result, the target distance between which and the search word is smaller than a preset target distance threshold value, may be added to the candidate word set; finally, a target word may be selected from the set of candidate words based on the target distance.
In some alternative implementations, the first selecting unit 602 may be further configured to determine the target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result by: for each word in the video text recognition result, the first selecting unit 602 may determine whether the character length of the search word is greater than the character length of the word, and if so, may determine a difference between the character length of the search word and the character length of the word, and may determine a difference between an initial distance corresponding to the search word and the difference as a target distance between the search word and the word.
In some alternative implementations, the first selecting unit 602 may be further configured to select a target word from the candidate word set based on the target distance by: the first selecting unit 602 may select a word having a smallest target distance from the search word from the candidate word set as a first candidate word; then, a target word may be selected from the first word candidates based on the character length of the first word candidate.
In some alternative implementations, the first selecting unit 602 may be further configured to select a target word from the first candidate words based on the character length of the first candidate words by: the first selecting unit 602 may determine whether a candidate word having a character length equal to that of the search word exists in the first candidate word; if the candidate word exists, the candidate word with the same character length as the search word can be selected from the first candidate words to be used as the target word.
In some alternative implementations, the first selecting unit 602 may be further configured to select a target word from the first candidate words based on the character length of the first candidate words by: the first selecting unit 602 may determine whether a candidate word having a character length equal to that of the search word exists in the first candidate word; if not, the candidate word with the longest character length can be selected from the first candidate words as the target word.
In some alternative implementations, the output unit 604 may be further configured to generate and output a corrected text based on the search term and the text presented in the target video frame by: the output unit 604 may compare the character length of the target word in the text presented in the target video frame with the character length of the search word; if the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, the output unit 604 may remove a target character string from the search word to obtain a remaining character string, where the target character string may be the target word in the text presented in the target video frame; then, the editing distance between the residual character string and other words except the target word in the text presented in the target video frame can be determined; if the position of the word having the smallest edit distance to the remaining character string in the target video frame is adjacent to the position of the target word in the target video frame, output section 604 may delete the word having the smallest edit distance to the remaining character string in the text presented in the target video frame, replace the target word in the deleted text with the search word, and output the corrected text.
In some alternative implementations, the output unit 604 may be further configured to generate and output a corrected text based on the search term and the text presented in the target video frame by: the output unit 604 may compare the character length of the target word in the text presented in the target video frame with the character length of the search word; if the character length of the target word in the text presented in the target video frame is greater than or equal to the character length of the search word, the target word in the text presented in the target video frame can be replaced by the search word, and the corrected text is output.
In some optional implementations, the second selecting unit 603 may be further configured to select, as the target video frame, a video frame meeting a preset condition from the video frames as follows: the second selecting unit 603 may select a video frame with the highest confidence level from the video frames as a target video frame.
Referring now to FIG. 7, a block diagram of an electronic device (e.g., the server of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the server; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: responding to the received search words searched aiming at the characters in the video, and acquiring a video text recognition result of the video; selecting words from the video text recognition results as target words based on the similarity between the search words and each word in the video text recognition results; acquiring video frames presenting target words, and selecting video frames meeting preset conditions from the video frames as target video frames; and generating and outputting corrected text based on the search words and the text presented in the target video frame.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first selection unit, a second selection unit, and an output unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquiring unit may also be described as a "unit that acquires a video text recognition result of a video in response to receiving a search word for searching for a text in the video".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (12)

1. An information output method, comprising:
responding to a received search word for searching characters in a video, and acquiring a video text recognition result of the video;
selecting words from the video text recognition results as target words based on the similarity of the search words and the words in the video text recognition results;
acquiring video frames presenting the target words, and selecting video frames meeting preset conditions from the video frames as target video frames;
and generating and outputting corrected text based on the search word and the text presented in the target video frame.
2. The method of claim 1, wherein the selecting words from the video text recognition results as target words based on the similarity of the search word to each word in the video text recognition results comprises:
determining an editing distance between the search word and each word in the video text recognition result as an initial distance;
determining a target distance between the search word and each word in the video text recognition result based on the initial distance, the character length of the search word, and the character length of each word in the video text recognition result;
adding words in the video text recognition result, wherein the target distance between the words and the search word is smaller than a preset target distance threshold value, into a candidate word set;
and selecting a target word from the candidate word set based on the target distance.
3. The method of claim 2, wherein determining the target distance between the search term and each word in the video text recognition result based on the initial distance, the character length of the search term, and the character length of each word in the video text recognition result comprises:
and determining whether the character length of the search word is greater than the character length of the word or not for each word in the video text recognition result, if so, determining the difference value between the character length of the search word and the character length of the word, and determining the difference value between the initial distance corresponding to the search word and the difference value as the target distance between the search word and the word.
4. The method of claim 2, wherein the selecting a target word from the set of candidate words based on the target distance comprises:
selecting a word with the minimum target distance to the search word from the candidate word set as a first candidate word;
and selecting a target word from the first candidate words based on the character length of the first candidate words.
5. The method of claim 4, wherein selecting a target word from the first word candidate based on the character length of the first word candidate comprises:
determining whether a candidate word having the same character length as the search word exists in the first candidate words;
and if so, selecting a candidate word with the same character length as the search word from the first candidate word as a target word.
6. The method of claim 4, wherein selecting a target word from the first word candidate based on the character length of the first word candidate comprises:
determining whether a candidate word having the same character length as the search word exists in the first candidate words;
and if the candidate word does not exist, selecting the candidate word with the longest character length from the first candidate words as the target word.
7. The method of claim 1, wherein generating the corrected text for output based on the search terms and the text presented in the target video frame comprises:
comparing a character length of the target word in text presented in the target video frame to a character length of the search term;
if the character length of the target word in the text presented in the target video frame is smaller than the character length of the search word, removing the target character string in the search word to obtain a residual character string, wherein the target character string is the target word in the text presented in the target video frame;
determining an edit distance between the remaining character string and words other than the target word in text presented in the target video frame;
and if the position of the other word with the minimum editing distance with the residual character string in the target video frame is adjacent to the position of the target word in the target video frame, deleting the other word with the minimum editing distance with the residual character string in the text presented in the target video frame, replacing the target word in the deleted text by using the search word, and outputting the corrected text.
8. The method of claim 1, wherein generating the corrected text for output based on the search terms and the text presented in the target video frame comprises:
comparing a character length of the target word in text presented in the target video frame to a character length of the search term;
and if the character length of the target word in the text presented in the target video frame is greater than or equal to the character length of the search word, replacing the target word in the text presented in the target video frame by using the search word to obtain a corrected text and outputting the corrected text.
9. The method according to any one of claims 1 to 8, wherein the selecting a video frame meeting a preset condition from the video frames as a target video frame comprises:
and selecting the video frame with the maximum confidence level from the video frames as a target video frame.
10. An information output apparatus, characterized by comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for responding to the received search words searched aiming at the characters in the video and acquiring the video text recognition result of the video;
the first selection unit is used for selecting words from the video text recognition result as target words based on the similarity between the search words and the words in the video text recognition result;
the second selection unit is used for acquiring the video frames presenting the target words and selecting the video frames meeting the preset conditions from the video frames as target video frames;
and the output unit is used for generating and outputting the corrected text based on the search word and the text presented in the target video frame.
11. A server, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202110015895.6A 2021-01-06 2021-01-06 Information output method and device and electronic equipment Active CN112699272B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110015895.6A CN112699272B (en) 2021-01-06 2021-01-06 Information output method and device and electronic equipment
PCT/CN2021/140160 WO2022148239A1 (en) 2021-01-06 2021-12-21 Method and apparatus for information output, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110015895.6A CN112699272B (en) 2021-01-06 2021-01-06 Information output method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112699272A true CN112699272A (en) 2021-04-23
CN112699272B CN112699272B (en) 2024-01-30

Family

ID=75514958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110015895.6A Active CN112699272B (en) 2021-01-06 2021-01-06 Information output method and device and electronic equipment

Country Status (2)

Country Link
CN (1) CN112699272B (en)
WO (1) WO2022148239A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022148239A1 (en) * 2021-01-06 2022-07-14 北京有竹居网络技术有限公司 Method and apparatus for information output, and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711412A (en) * 2018-12-27 2019-05-03 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on dictionary
US10402490B1 (en) * 2015-08-14 2019-09-03 Shutterstock, Inc. Edit distance based spellcheck
CN112052352A (en) * 2020-09-07 2020-12-08 北京达佳互联信息技术有限公司 Video sequencing method, device, server and storage medium
CN112115299A (en) * 2020-09-17 2020-12-22 北京百度网讯科技有限公司 Video searching method and device, recommendation method, electronic device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5257071B2 (en) * 2006-08-03 2013-08-07 日本電気株式会社 Similarity calculation device and information retrieval device
US9740767B2 (en) * 2013-03-15 2017-08-22 Mapquest, Inc. Systems and methods for analyzing failed and successful search queries
CN107291904A (en) * 2017-06-23 2017-10-24 百度在线网络技术(北京)有限公司 A kind of video searching method and device
CN112699272B (en) * 2021-01-06 2024-01-30 北京有竹居网络技术有限公司 Information output method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402490B1 (en) * 2015-08-14 2019-09-03 Shutterstock, Inc. Edit distance based spellcheck
CN109711412A (en) * 2018-12-27 2019-05-03 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on dictionary
CN112052352A (en) * 2020-09-07 2020-12-08 北京达佳互联信息技术有限公司 Video sequencing method, device, server and storage medium
CN112115299A (en) * 2020-09-17 2020-12-22 北京百度网讯科技有限公司 Video searching method and device, recommendation method, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022148239A1 (en) * 2021-01-06 2022-07-14 北京有竹居网络技术有限公司 Method and apparatus for information output, and electronic device

Also Published As

Publication number Publication date
WO2022148239A1 (en) 2022-07-14
CN112699272B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN109241286B (en) Method and device for generating text
CN106022349B (en) Method and system for device type determination
CN109858045B (en) Machine translation method and device
CN114861889B (en) Deep learning model training method, target object detection method and device
CN111078825A (en) Structured processing method, structured processing device, computer equipment and medium
CN112712795B (en) Labeling data determining method, labeling data determining device, labeling data determining medium and electronic equipment
CN111582477A (en) Training method and device of neural network model
CN112988753B (en) Data searching method and device
CN114241471B (en) Video text recognition method and device, electronic equipment and readable storage medium
CN112699272B (en) Information output method and device and electronic equipment
KR102382421B1 (en) Method and apparatus for outputting analysis abnormality information in spoken language understanding
CN113553309A (en) Log template determination method and device, electronic equipment and storage medium
CN113761845A (en) Text generation method and device, storage medium and electronic equipment
CN116028868B (en) Equipment fault classification method and device, electronic equipment and readable storage medium
US10810497B2 (en) Supporting generation of a response to an inquiry
US10257055B2 (en) Search for a ticket relevant to a current ticket
CN111626054A (en) New illegal behavior descriptor identification method and device, electronic equipment and storage medium
CN112784596A (en) Method and device for identifying sensitive words
US11157829B2 (en) Method to leverage similarity and hierarchy of documents in NN training
CN111339776B (en) Resume parsing method and device, electronic equipment and computer-readable storage medium
CN112966752B (en) Image matching method and device
US11099977B1 (en) Method, device and computer-readable storage medium for testing bios using menu map obtained by performing image identification on bios interface
CN111597224A (en) Method and device for generating structured information, electronic equipment and storage medium
CN111753548A (en) Information acquisition method and device, computer storage medium and electronic equipment
CN111460971A (en) Video concept detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant