CN110728973A

CN110728973A - Video resource output method and server

Info

Publication number: CN110728973A
Application number: CN201911013758.8A
Authority: CN
Inventors: 隋雪芹
Original assignee: Qingdao Poly Cloud Technology Co Ltd
Current assignee: Qingdao Poly Cloud Technology Co Ltd
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-01-24

Abstract

The embodiment of the invention provides a method and a server for outputting video resources, wherein after an initial voice text corresponding to voice information obtained by voice recognition through third-party software and sent by a display device is obtained, the server does not directly output the video resources corresponding to the initial voice text, but determines a target voice text with higher accuracy corresponding to the initial voice text according to the initial voice text and a voice text library and outputs the video resources corresponding to the target voice text to the display device.

Description

Video resource output method and server

Technical Field

The invention relates to the technical field of video playing, in particular to an output method of video resources and a server.

Background

With the development of science and technology, the trend of intellectualization of the television is also developed. The intelligent television has the functions of traditional videos, games and the like, also has the network function, can realize cross-platform search among the television, the network and programs, and can enable a user to obtain video resources required by the user through the intelligent television.

Taking the example that a user controls the smart television to play video resources, the user may input voice information to the smart television through a microphone of the smart television, for example, the voice information is "i want to watch AAA", after the smart television receives the voice information "i want to watch AAA", the voice information "i want to watch AAA" is subjected to voice recognition through third-party software, so as to obtain text information corresponding to the voice information "i want to watch AAA", and the text information is returned to the smart television; and the intelligent television sends the text information to a corresponding server so that the server determines the video resource corresponding to the voice information according to the text information and sends the video resource to the intelligent television, and the intelligent television displays the video resource corresponding to the text information to the user.

However, when the third-party software performs voice recognition on the voice information "i want to see AAA", the text information obtained after the voice recognition may be wrong due to the influence of multiple factors such as homophones, pauses, and sentences of chinese, for example, the voice information "i want to see AAA" is recognized as the text information "i want to see BBB", so that the video resource output by the server to the smart television is the video resource corresponding to the text information "i want to see BBB", and obviously the video resource is not what the user wants to see, thereby resulting in low accuracy of the video resource output by the server.

Disclosure of Invention

The embodiment of the invention provides a video resource output method and a server, which improve the accuracy of video resources output by the server.

In a first aspect, an embodiment of the present invention provides an output method of a video resource, where the output method of the video resource may include:

acquiring an initial voice text corresponding to voice information sent by display equipment;

determining a target voice text corresponding to the initial voice text according to the initial voice text and the voice text library; the voice texts in the voice text library are standard voice texts subjected to click operation of a user;

and outputting the video resource corresponding to the target voice text to the display equipment, wherein the video resource is used for responding to the voice information.

In a second aspect, an embodiment of the present invention further provides a server, where the server may include:

the device comprises an acquisition unit, a display unit and a processing unit, wherein the acquisition unit is used for acquiring an initial voice text corresponding to voice information sent by display equipment;

the processing unit is used for determining a target voice text corresponding to the initial voice text according to the initial voice text and the voice text library; the voice texts in the voice text library are standard voice texts subjected to click operation of a user;

and the output unit is used for outputting the video resource corresponding to the target voice text to the display equipment, and the video resource is used for responding to the voice information.

In a third aspect, an embodiment of the present invention further provides a server, where the server may include a memory and a processor;

a memory for storing a computer program;

a processor for reading the computer program stored in the memory and executing the output method of the video resource according to any one of the above first aspect.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer executing instruction is stored, and when a processor executes the computer executing instruction, the method for outputting a video resource according to any one of the above first aspects is performed.

The output method and the server for the video resources provided by the embodiment of the invention are different from the output method and the server for the video resources in the prior art, in the embodiment of the invention, after the server acquires the initial voice text corresponding to the voice information which is sent by the display equipment and is obtained by the voice recognition of the third-party software, the server does not directly output the video resource corresponding to the initial voice text, firstly, according to the initial voice text and the voice text library, the target voice text with higher accuracy corresponding to the initial voice text is determined, and the video resource corresponding to the target voice text is output to the display equipment, since the voice texts in the voice text library are all standard voice texts in which the user click operation occurs, therefore, the low accuracy of the output video resources caused by the error of voice recognition can be effectively avoided, and the accuracy of the video resources output by the server is effectively improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2 is a schematic flowchart illustrating an output method of a video resource according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating another method for outputting video resources according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a hidden state at the current time according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating text acquisition according to an embodiment of the present invention;

fig. 6 is a schematic diagram of another hidden state at the current time according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another server according to an embodiment of the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In the description of the present invention, the character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present invention, and for example, please refer to fig. 1, where the application scenario includes a display device and a server, when a user controls the display device to output a video resource through voice, the display device performs voice recognition on voice information input by the user by using third-party software to obtain text information obtained through the voice recognition, and sends the text information to the server, so that the server determines a video resource corresponding to the voice information according to the text information and sends the video resource to the display device, and correspondingly, the display device displays the video resource to the user. However, when the third-party software performs speech recognition, text information obtained after speech recognition may be incorrect due to the influence of various factors such as homonyms, pauses, sentence breaks, and the like of chinese, so that the accuracy of video resources output by the server is not high.

In order to improve the accuracy of the video resource output by the server, the embodiment of the invention provides a video resource output method, which is different from the output of the video resource in the prior art, in the embodiment of the invention, after the server acquires the initial voice text corresponding to the voice information obtained by voice recognition through third-party software sent by the display device, the server does not directly output the video resource corresponding to the initial voice text, but firstly determines the target voice text with higher accuracy corresponding to the initial voice text according to the initial voice text and the voice text library and outputs the video resource corresponding to the target voice text to the display device, because the voice texts in the voice text library are all standard voice texts which are clicked by a user, the low accuracy of the output video resource caused by the error of the voice recognition can be effectively avoided, therefore, the accuracy of the video resource output by the server is effectively improved.

It can be understood that a plurality of voice texts are pre-stored in the voice text library, and the plurality of voice texts are obtained through multiple verification, and it can be understood that the voice texts in the voice text library are all standard voice texts in which a user click operation has occurred. For example, it may be counted that the number of times of the previous voice information input by the user is greater than a certain threshold, and based on the voice information, the server may output the video resource, and the user may normally click the voice text of the video resource, where the user may normally click the video resource, which indicates that the video resource is indeed the video resource that the user wants to watch, so as to indicate that the voice text is a standard voice text, and based on the voice text, an accurate video resource may be output, and the standard voice text is stored in the voice text library, so that a highly accurate voice text library may be obtained.

For example, the display device may be a television, a notebook computer, a tablet computer, or other terminal devices having a video playing function and a data interaction function with a server, and here, the embodiment of the present invention is only described by taking the display device may be a television, a notebook computer, or a tablet computer as an example, but the embodiment of the present invention is not limited thereto.

The following describes the technical solution of the present invention and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 2 is a schematic flowchart of an output method of a video resource according to an embodiment of the present invention, for example, please refer to fig. 2, where the output method of the video resource may include:

s201, acquiring an initial voice text corresponding to the voice information sent by the display equipment.

In an example, after receiving voice information input by a user through a microphone of the display device or receiving voice information sent by other devices, the display device sends the voice information to third-party software to perform voice recognition on the voice information through the third-party software, receives a voice recognition result returned by the third-party software, where the voice recognition result is an initial voice text corresponding to the voice information, and then sends the initial voice text to the server, so that the server obtains the initial voice text corresponding to the voice information.

Taking the voice information "i want to see the piglet cookie" as an example, the user may input the voice information "i want to see the piglet cookie" to the display device through a microphone of the display device, so that after receiving the voice information "i want to see the piglet cookie", the display device first sends the voice information "i want to see the piglet cookie" to third-party software, performs voice recognition on the voice information "i want to see the piglet cookie" through the third-party software, receives a voice recognition result returned by the third-party software, where the voice recognition result is an initial voice text corresponding to the voice information "i want to see the piglet cookie", and then sends the initial voice text to the server, so that the server obtains the initial voice text corresponding to the voice information.

After the initial voice text corresponding to the voice information sent by the display device is acquired, the target voice text corresponding to the initial voice text can be determined according to the initial voice text and the voice text library, that is, the following S202 is executed:

s202, determining a target voice text corresponding to the initial voice text according to the initial voice text and the voice text library.

The voice texts in the voice text library are all standard voice texts with click operation of the user.

It can be seen that, different from outputting a video resource in the prior art, in the embodiment of the present invention, after obtaining an initial voice text corresponding to voice information obtained by performing voice recognition through third-party software and sent by a display device, a server does not directly output a video resource corresponding to the initial voice text, but determines a target voice text with higher accuracy corresponding to the initial voice text according to the initial voice text and a voice text library, so as to output a video resource corresponding to the target voice text to the display device.

For example, after obtaining the initial voice text corresponding to the voice information "i want to see the pig cookie", the server may match the initial voice text corresponding to the voice information "i want to see the pig cookie" with the voice text library, determine a target voice text corresponding to the initial voice text according to a matching result, and then output a video resource corresponding to the target voice text to the display device.

And S203, outputting the video resource corresponding to the target voice text to the display equipment.

Wherein the video assets are used to respond to user input of voice information.

Therefore, in the embodiment of the invention, after the server acquires the initial voice text corresponding to the voice information obtained by the voice recognition through the third-party software and sent by the display device, the video resource corresponding to the initial voice text is not directly output, but the target voice text with higher accuracy corresponding to the initial voice text is determined according to the initial voice text and the voice text library, and the video resource corresponding to the target voice text is output to the display device.

Based on the embodiment shown in fig. 2, in the above S202, when determining the target speech text corresponding to the initial speech text according to the initial speech text and the speech text library, the target text corresponding to the initial speech text may be determined according to an edit distance or the edit distance and a text error correction model, for example, as shown in fig. 3, fig. 3 is a schematic flow chart of another output method of a video resource provided by the embodiment of the present invention, and the output method of the video resource may further include:

s301, searching an initial voice text in a voice text library.

S302, if the initial voice text is found in the voice text library, determining the initial voice text as the target voice text.

For example, if the voice recognition result of the voice message "i want to see the piglet cookie" returned by the third-party software received by the display device is "i want to see the piglet cookie", that is, the initial voice text is "i want to see the piglet cookie", the initial voice text "i want to see the piglet cookie" is sent to the corresponding server, so that the server searches the voice text "i want to see the piglet cookie" in the voice text library, and if the voice text is found to be "i want to see the piglet cookie", the initial voice text returned by the third-party software is an accurate voice text, and the accurate initial voice text is the target voice text.

For example, if the voice recognition result of the voice message "i want to see the pig pouch" returned by the third-party software received by the server is "i want to see the pigskin organ", that is, the initial voice text is "i want to see the pigskin organ", the initial voice text "i want to see the pigskin organ" will be sent to the corresponding server in the same way, so that the server searches the initial voice text "i want to see the pigskin organ" in the voice text library, after the initial voice text "i want to see the pigskin organ" is determined not to be found in the voice text library, and it is described that the initial voice text returned by the third-party software is the voice text with the wrong voice recognition, the following steps S303 to S307 are performed.

And S303, if the initial voice text is not found in the voice text library, calculating the editing distance between the initial voice text and each voice text in the voice text library to obtain the editing distance between the initial voice text and each voice text.

When the edit distance between the initial voice text "i want to see the pigskin piano" and each voice text in the voice text library is calculated, the edit distance between the initial voice text "i want to see the pigskin piano" and each voice text in the voice text library may be calculated according to the function edit (i, j). The edge (i, j) represents the edit distance from the substring of the first character string of length i to the substring of the second character string of length j. When calculating the edit (i, j), if i is 0 and j is 0, then the edit (i, j) is 0; if i is 0 and j >0, then edge (i, j) is j; if i >0 and j is 0, then edge (i, j) is i; if i is equal to or greater than 1 and j is equal to or greater than 1, then it (i, j) ═ min { edit (i-1, j) +1, edit (i, j-1) +1, edit (i-1, j-1) + f (i, j) }. When the ith character of the first character string is not equal to the jth character of the second character string, f (i, j) is 1; otherwise, f (i, j) is 0.

It can be understood that when the edit distance between the initial voice text "i wants to see the pigskin organ" and each voice text in the voice text library is respectively calculated according to the function edge (i, j), the edit distance between the initial voice text "i wants to see the pigskin organ" and each voice text in the voice text library can be directly calculated, and of course, in order to improve the calculation efficiency, the initial voice text "i wants to see the pigskin organ" can be processed first, the common vocabulary "i wants to see" in the initial voice text "i wants to see the pigskin organ" is removed, and the keyword voice text "pigskin organ" is obtained, so that only the edit distance between the keyword voice text "pigskin organ" and each voice text in the voice text library needs to be calculated, and the calculation efficiency is improved.

After the edit distance between the initial voice text "i want to see the pigskin piano" and each voice text in the voice text library is respectively calculated according to the function edge (i, j), the relationship between each obtained edit distance and the preset threshold value can be judged.

And S304, judging the relation between the editing distance and a preset threshold value.

The preset threshold may be 2 or 3, and may be specifically set according to actual needs, where the embodiment of the present invention is not further limited to the value of the preset threshold.

S305, if the editing distance between the first voice text and the initial voice text in the voice text library is smaller than a preset threshold value, determining the first voice text as a target voice text corresponding to the initial voice text.

And comparing each calculated editing distance with a preset threshold, and if the editing distance between the first voice text and the initial voice text is smaller than the preset threshold, indicating that the voice information is recognized as the initial voice text if the voice information is wrong when the third-party software performs voice recognition, and the correct voice text is the first voice text, so that the first voice text can be determined as the target voice text corresponding to the initial voice text.

Optionally, when the target voice text is determined according to the editing distance, if only the editing distance between the first voice text and the initial voice text in the voice text library is smaller than a preset threshold, directly determining the first voice text as the target voice text; if the editing distance between the plurality of first voice texts and the initial voice text is smaller than the preset threshold value, that is, when the number of the first voice texts is multiple, the editing distance between each first voice text and the initial voice text in the plurality of first voice texts may be compared first, and the second voice text with the smallest editing distance between the second voice text and the initial voice text is determined as the target voice text.

Further, if there are a plurality of second speech texts with the minimum editing distance from the initial speech text, one second speech text may be arbitrarily selected from the plurality of second speech texts, and the second speech text is determined as the target speech text.

If the initial voice text corresponding to the voice message "i want to see the pig pouch" is longer than the preset threshold, and the editing distance between the initial voice text "i want to see the pig pike" and each voice text in the voice text library is longer than the preset threshold, at this time, an accurate target voice text corresponding to the initial voice text "i want to see the pig pike" cannot be obtained in the manner of editing distance, so that the accurate target voice text corresponding to the initial voice text "i want to see the pig pike" can be determined continuously through the text error correction model, that is, the following steps S306-S307 are executed.

S306, if the editing distance between any voice text in the voice text library and the initial voice text is larger than a preset threshold value, carrying out error correction processing on the initial voice text through a text error correction model to obtain an error-corrected voice text.

The text error correction model is obtained by training a sample speech text code into a plurality of semantic features. It should be noted that, in the prior art, when a text error correction model is trained by a sample speech text, the sample speech text is encoded into a unique semantic feature c, the semantic feature is then decoded, but since the unique semantic feature c must contain all the information in the original sequence, when the sentence is long, the unique semantic feature c may not have much information, which may cause the error correction accuracy to decrease, so that, unlike the prior art, in the embodiment of the present invention, when the text error correction model is trained by the sample voice text, the sample voice text is coded into a plurality of semantic features, and then, decoding the plurality of semantic features, thereby effectively avoiding the problem of error correction precision reduction caused by the fact that more information can not be stored in one semantic feature c, and further improving the accuracy of the text error correction model to a certain extent.

When the initial voice text "i want to see the pigskin qin" is subjected to error correction processing through a text error correction model, the initial voice text "i want to see the pigskin qin" can be encoded first, so as to obtain an input sequence x { (x 1, x2, x3, x4, x5, x6, x7} { (i), { "i", "want to see", "small", "pig", "skin", "qin" }, which is divided into three partial input, output and intermediate hidden states, and the maximum difference between the hidden states and a traditional neural network is that the hidden state at the previous moment can be received, and a next encoded vector is obtained: wherein h1 ═ f (Ux1+ Wh0+ b); h2 ═ f (Ux2+ Wh1+ b); h3 ═ f (Ux3+ Wh2+ b); h4 ═ f (Ux4+ Wh3+ b); h5 ═ f (Ux5+ Wh4+ b); h6 ═ f (Ux6+ Wh5+ b); h7 ═ f (Ux7+ Wh6+ b), resulting in an encoded vector. H0 is a preset initial vector, w, u, b are obtained by training a sample in advance, and the relationship between the current vector and the accurate text can be embodied through w, u, b, so that error correction processing on the current vector can be realized based on w, u, b. It can be seen that to calculate the hidden state hn at the current time, 2 inputs are required, one is the current input xn, and the other is the previous hidden state hn-1, as shown in fig. 4, where fig. 4 is a schematic diagram of the hidden state at the current time according to an embodiment of the present invention.

Since the encoded vector obtained based on w, u, b is accurate, the encoded vector is decoded again, i.e. the vector is translated into the corresponding text, which is also accurate. In translating a vector into a corresponding text, for example, please refer to fig. 5, where fig. 5 is a schematic diagram of obtaining a text provided in an embodiment of the present invention, and it is assumed that an input sequence is encoded into 7 semantic features during an encoding process, where the 7 semantic features are c, and h can be decoded into a corresponding vector h ' according to h ' and c, and then decoded into a corresponding vector h ' according to yi ═ g (Ci, y, y,.. yi-1}) to get final y1, y2, y3, y4, y5, y6 and y7, so as to get initial phonetic text "i want to see pigskin organ" phonetic text after error correction processing "i want to see pig peck" according to y1, y2, y3, y4, y5, y6 and y 7. Wherein h 0' is 0.

When the input sequence is encoded into multiple semantic features, taking the text "hao la shi", semantic features c1 ═ h1 a11+ h2 a12+ h3 a3, c3 ═ h3 a 3+ h3 a3, c3 ═ h3 a 3+ h3 a3 b 36j, and the hidden states of the current invention 36j are the hidden semantic states 36j, 36j and 36j are the hidden states.

S307, determining the corrected voice text as a target voice text corresponding to the initial voice text.

After the initial voice text 'i want to see the pigskin piano' is obtained through the text error correction model, the voice text 'i want to see the pig cookie' after error correction processing is determined as the final target voice text, and the video resource corresponding to the target voice text 'i want to see the pig cookie' is output to the display device, so that the problem that the accuracy of the output video resource is not high due to voice recognition errors can be effectively avoided, and the accuracy of the video resource output by the server is effectively improved.

In addition, it can be understood that after the corrected voice text is obtained through the text error correction model, the corrected voice text can be further stored in the voice text library, so as to update the voice text library, and thus, if the video resource corresponding to the initial voice text needs to be output subsequently, the updated voice text library is directly searched without performing error correction processing on the initial voice text through the text error correction model, and the video resource corresponding to the corrected voice text is directly output, so that the accuracy of the output video resource is ensured, and meanwhile, the output efficiency of the server video resource can be further improved.

Fig. 7 is a schematic structural diagram of a server 70 according to an embodiment of the present invention, for example, please refer to fig. 7, where the server 70 may include:

an obtaining unit 701, configured to obtain an initial voice text corresponding to the voice information sent by the display device.

A processing unit 702, configured to determine, according to the initial voice text and the voice text library, a target voice text corresponding to the initial voice text; the voice texts in the voice text library are all standard voice texts with the click operation of the user.

An output unit 703 is configured to output, to the display device, a video resource corresponding to the target voice text, where the video resource is used for responding to the voice information.

Optionally, the processing unit 702 is specifically configured to search for an initial voice text in the voice text library; and if the initial voice text is found in the voice text library, determining the initial voice text as the target voice text.

Optionally, the processing unit 702 is further configured to calculate an editing distance between the initial voice text and each voice text in the voice text library if the initial voice text is not found in the voice text library; and determining a target voice text corresponding to the initial voice text according to the calculation result of the editing distance.

Optionally, the processing unit 702 is specifically configured to determine, if an edit distance between the first voice text and the initial voice text in the voice text library is smaller than a preset threshold, the first voice text as a target voice text corresponding to the initial voice text.

Optionally, the processing unit 702 is specifically configured to determine, from the multiple first voice texts, a second voice text with a minimum editing distance from the initial voice text; and determining the second voice text as a target voice text corresponding to the initial voice text.

Optionally, the processing unit 702 is specifically configured to, if the edit distance between any one of the voice texts in the voice text library and the initial voice text is greater than a preset threshold, perform error correction processing on the initial voice text through a text error correction model to obtain an error-corrected voice text; the text error correction model is obtained by training a sample voice text code into a plurality of semantic features; and determining the corrected voice text as a target voice text corresponding to the initial voice text.

Optionally, the processing unit 702 is further configured to store the corrected voice text in a voice text library.

The server 70 shown in the embodiment of the present invention may execute the technical solution of the output method of the video resource in any one of the embodiments shown in the above drawings, and the implementation principle and the beneficial effect of the technical solution are similar to those of the output method of the video resource, and are not described herein again.

Fig. 8 is a schematic structural diagram of another server 80 according to an embodiment of the present invention, for example, please refer to fig. 8, where the server 80 may include a memory 801 and a processor 802.

A memory 801 for storing a computer program.

The processor 802 is configured to read the computer program stored in the memory 801 and execute the output method of the video resource according to any of the embodiments described above according to the computer program in the memory 801.

Alternatively, the memory 801 may be separate or integrated with the processor 802. When the memory 801 is a device independent from the processor 802, the management apparatus may further include: a bus for connecting the memory 801 and the processor 802.

Optionally, this embodiment further includes: a communication interface that may be coupled to the processor 802 via a bus. The processor 802 may control the communication interface to implement the receiving and transmitting functions of the server described above.

The server 80 shown in the embodiment of the present invention may execute the technical solution of the output method of the video resource in any one of the embodiments shown in the above drawings, and the implementation principle and the beneficial effect of the technical solution are similar to those of the output method of the video resource, and are not described here again.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer execution instruction is stored in the computer-readable storage medium, and when a processor executes the computer execution instruction, the method for outputting video resources according to any of the above embodiments is implemented.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present invention are not limited to only one bus or one type of bus.

The computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An output method of a video resource, comprising:

2. The method of claim 1, wherein determining the target speech text corresponding to the initial speech text according to the initial speech text and the speech text library comprises:

searching the initial voice text in the voice text library;

and if the initial voice text is found in the voice text library, determining the initial voice text as the target voice text.

3. The method of claim 2, further comprising:

if the initial voice text is not found in the voice text library, calculating an editing distance between the initial voice text and each voice text in the voice text library;

and determining the target voice text corresponding to the initial voice text according to the calculation result of the editing distance.

4. The method according to claim 3, wherein the determining the target speech text corresponding to the initial speech text according to the calculation result of the edit distance comprises:

and if the editing distance between the first voice text and the initial voice text in the voice text library is smaller than a preset threshold value, determining the first voice text as the target voice text corresponding to the initial voice text.

5. The method of claim 4, wherein if the number of the first phonetic texts is multiple, the determining the first phonetic text as the target phonetic text corresponding to the initial phonetic text comprises:

determining a second voice text with the minimum editing distance with the initial voice text in the plurality of first voice texts;

and determining the second voice text as the target voice text corresponding to the initial voice text.

6. The method according to claim 3, wherein the determining the target speech text corresponding to the initial speech text according to the calculation result of the edit distance comprises:

if the editing distance between any voice text in the voice text library and the initial voice text is larger than a preset threshold value, performing error correction processing on the initial voice text through a text error correction model to obtain an error-corrected voice text; the text error correction model is obtained by training a sample voice text code into a plurality of semantic features;

and determining the corrected voice text as the target voice text corresponding to the initial voice text.

7. The method of claim 6, further comprising:

and storing the corrected voice text in the voice text library.

8. A server, comprising:

9. The server according to claim 8,

the processing unit is specifically configured to search the initial voice text in the voice text library; and if the initial voice text is found in the voice text library, determining the initial voice text as the target voice text.

10. The server according to claim 9,

the processing unit is further configured to calculate an editing distance between the initial voice text and each voice text in the voice text library if the initial voice text is not found in the voice text library; and determining the target voice text corresponding to the initial voice text according to the calculation result of the editing distance.