CN111966839B

CN111966839B - Data processing method, device, electronic equipment and computer storage medium

Info

Publication number: CN111966839B
Application number: CN202010826912.XA
Authority: CN
Inventors: 王睿宇; 程启健; 尚岩; 任翔宇; 张笑强
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2023-07-25
Anticipated expiration: 2040-08-17
Also published as: CN111966839A

Abstract

The embodiment of the invention provides a data processing method, a device, electronic equipment and a computer storage medium, wherein an audio file and a picture file for making an audio picture are obtained, each audio data is divided according to audio text information in each audio data to obtain a plurality of audio sub-data, and if the similarity of image text information in the picture and the audio text information in each audio sub-data is not smaller than a preset similarity threshold value, the corresponding relation between the picture and the audio sub-data is recorded; the corresponding relation between the picture and the audio sub-data is utilized to manufacture the audio file and the picture file into an audio picture, so that the automatic matching of the content of the audio part and the content of the non-audio part is realized, and the efficiency of manufacturing the audio picture is improved.

Description

Data processing method, device, electronic equipment and computer storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a computer storage medium.

Background

Reading can help people to know the world, learn knowledge, cultivate good hobbies and interests and improve thinking ability. The traditional reading mode is to acquire information through visual browsing. In order to improve the interest of people in reading, a novel reading mode is developed, and information can be obtained through a mode of combining hearing with vision, such as audio books, children's books, adult's books and the like. When hearing is the primary mode and vision is the secondary mode, the imagination of the reader can be better stimulated.

When information is acquired in a mode of combining hearing and vision, the information needs to comprise an audio part and a non-audio part, the non-audio part can be words, images or pictures, the audio part can explain the content of the non-audio part, and in order to better understand the content of books, the content in the audio part and the content in the non-audio part need to be in one-to-one correspondence, so that when the content of the non-audio part is read randomly, the corresponding audio part can be played automatically. In the prior art, the contents of the audio part and the non-audio part in the audio book, the children book, the adult book and the like need to be manually matched, and the whole process has complicated operation and is easy to cause errors, so that the efficiency is lower.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a data processing method, apparatus, electronic device, and computer storage medium, so as to automatically match the content of an audio portion with the content of a non-audio portion. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:

acquiring an audio file and a picture file for making an audio picture, wherein the audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data;

for any picture, respectively calculating the similarity between the image text information in the picture and the audio text information in each audio sub-data;

if the similarity between the image text information in any picture and the audio text information in any audio sub-data is not smaller than a preset similarity threshold value, recording the corresponding relation between the picture and the audio sub-data;

And making the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

Optionally, the dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data includes:

recognizing the audio text information in the audio data by utilizing a voice recognition technology to obtain text data in the audio data;

and carrying out semantic relation recognition on the text data, and respectively dividing each audio data according to a semantic relation recognition result to obtain a plurality of audio sub-data.

Optionally, the text data in the audio data includes a time stamp of each character; after the step of recognizing the audio text information in the audio data by using the voice recognition technology to obtain text data in the audio data, the method further comprises:

sequentially reading each character in the text data according to the sequence of the characters in the text data;

calculating the difference value of the time stamps of adjacent characters, if the difference value is not smaller than a preset difference value threshold value, dividing the adjacent characters into two different audio sub-data, wherein the character with the early time stamp is divided into the former audio sub-data, and the character with the late time stamp is divided into the latter audio sub-data;

And if the difference value is smaller than the preset difference value threshold value, dividing the adjacent characters into the same audio sub-data.

Optionally, for any one of the pictures, calculating the similarity between the image text information in the picture and the audio text information in each of the audio sub-data includes:

identifying image text information in any picture based on text features in the picture;

inputting the recognized image text information and each audio sub-data into a pre-trained matching model in sequence to obtain the matching confidence of the image text information and the audio text information in each audio sub-data;

if the similarity between the image text information in any one of the pictures and the audio text information in any one of the audio sub-data is not smaller than a preset similarity threshold, recording the corresponding relationship between the picture and the audio sub-data, including:

and when the matching confidence coefficient of the text image data and any audio sub-data is not smaller than a preset first confidence coefficient threshold value, recording the corresponding relation between the picture and the audio sub-data.

Optionally, after identifying the image text information in any of the pictures, the method further includes:

If the image text information corresponding to the plurality of pictures is the same, identifying the image text data in any picture in the picture file based on the image characteristics;

sequentially matching the identified image text data with each audio sub-data to obtain the matching confidence of the image text data and each audio sub-data;

when the matching confidence coefficient of the text image data and any audio sub-data is not smaller than a preset first confidence coefficient threshold value, recording the corresponding relation between the picture and the audio sub-data, including:

when the matching confidence of the matching result of the image text data and any audio sub-data reaches a preset second confidence threshold, recording the corresponding relation between the picture and the audio sub-data.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, the apparatus including:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring an audio file and a picture file for making an audio picture, the audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

The dividing module is used for dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data;

the computing module is used for respectively computing the similarity between the image text information in the picture and the audio text information in each audio sub-data aiming at any picture;

the recording module is used for recording the corresponding relation between the picture and the audio sub-data if the similarity between the image text information in any picture and the audio text information in any audio sub-data is not smaller than a preset similarity threshold value;

and the manufacturing module is used for manufacturing the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

Optionally, the dividing module includes:

the audio text information recognition sub-module is used for recognizing the audio text information in the audio data by utilizing a voice recognition technology to obtain text data in the audio data;

the first dividing sub-module is used for carrying out semantic relation recognition on the text data, and dividing each audio data according to semantic relation recognition results to obtain a plurality of audio sub-data.

Optionally, the text data in the audio data includes a time stamp of each character; the apparatus further comprises:

the reading sub-module is used for sequentially reading each character in the text data according to the sequence of the characters in the text data;

the difference value calculation sub-module is used for calculating the difference value of the time stamp of the adjacent character, if the difference value is not smaller than a preset difference value threshold value, the adjacent character is divided into two different audio sub-data, wherein the character with the early time stamp is divided into the former audio sub-data, and the character with the late time stamp is divided into the latter audio sub-data;

and the second dividing sub-module is used for dividing the adjacent characters into the same audio sub-data if the difference value is smaller than the preset difference value threshold value.

Optionally, the computing module includes:

the first image text information identification sub-module is used for identifying the image text information in any picture based on the text features in the picture;

the first matching sub-module is used for inputting the recognized image text information and each audio sub-data into a pre-trained matching model in sequence to obtain the matching confidence degree of the image text information and the audio text information in each audio sub-data;

The recording module is specifically used for:

Optionally, the apparatus further includes:

the second image text information identification sub-module is used for identifying image text data in any picture in the picture file based on image characteristics if the image text information corresponding to the pictures is the same;

the second matching sub-module is used for sequentially matching the recognized image text data with each audio sub-data to obtain the matching confidence of the image text data and each audio sub-data;

the recording module is specifically used for:

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

The memory is used for storing a computer program;

the processor is configured to implement the method according to any one of the first aspect when executing the computer program stored on the memory.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having a computer program stored therein, which when executed by a processor implements the method according to any of the first aspects.

The data processing method, the device, the electronic equipment and the computer storage medium provided by the embodiment of the invention can acquire the audio file and the picture file for making the audio picture, wherein the audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information; dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data; for any picture, similarity between image text information in the picture and audio text information in each audio sub-data is calculated respectively; if the similarity between the image text information in any picture and the audio text information in any audio sub-data is not smaller than a preset similarity threshold value, recording the corresponding relation between the picture and the audio sub-data; and making the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

In the embodiment of the invention, the audio data can be automatically divided into a plurality of audio sub-data, the similarity of the image text information in the picture and the audio text information in each audio sub-data is calculated for any picture, and the picture is corresponding to the audio sub-data according to the similarity. By applying the embodiment of the invention, the automatic association between each picture and the audio sub-data of the audio data can be realized, and the automatic matching of the content of the audio part and the content of the non-audio part can be realized. And using the corresponding relation between the picture and the audio sub-data to manufacture the audio file and the picture file into an audio picture, thereby improving the efficiency of manufacturing the audio picture. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a first data processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a second data processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a third data processing method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a fourth data processing method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a fifth data processing method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a sixth data processing method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a first data processing apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a second data processing apparatus according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a third data processing apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a fourth data processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problem that in the prior art, manual matching is required to match the content of the audio part with the content of the non-audio part in the audio picture, the embodiment of the invention provides a data processing method, a data processing device, an electronic device, a computer storage medium and a computer program product containing instructions.

The following first describes a data processing method provided in an embodiment of the present invention. The method is applied to electronic equipment, and in particular, the electronic equipment can be any electronic equipment which can provide data processing services, such as a personal computer, a server and the like. The data processing method provided by the embodiment of the invention can be realized by at least one of software, a hardware circuit and a logic circuit arranged in the electronic equipment.

Referring to fig. 1, fig. 1 is a flowchart of a first data processing method according to an embodiment of the present invention; the method comprises the following steps:

s101, acquiring an audio file and a picture file for making an audio picture. The audio file comprises at least one audio data, and the picture file comprises a picture; each of the audio data includes audio text information, and each of the pictures includes image text information.

S102, dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data.

S103, for any picture, calculating the similarity between the image text information in the picture and the audio text information in each audio sub-data.

And S104, if the similarity between the image text information in any picture and the audio text information in any audio sub-data is not smaller than a preset similarity threshold, recording the corresponding relation between the picture and the audio sub-data.

S105, the audio file and the picture file are manufactured into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

According to the embodiment of the invention, the audio data can be automatically divided into a plurality of audio sub-data, the similarity of the image text information in the picture and the audio text information in each audio sub-data is calculated for any picture, and the picture is corresponding to the audio sub-data according to the similarity, so that the automatic association between each picture and the audio sub-data of the audio data is realized, and the automatic matching of the content of the audio part and the content of the non-audio part is realized. And using the corresponding relation between the picture and the audio sub-data to manufacture the audio file and the picture file into an audio picture, thereby improving the efficiency of manufacturing the audio picture.

The audio picture can be an audio book, a children picture book, an adult picture book and the like, and the audio picture comprises the content of an audio part and the content of a non-audio part, wherein the content of the audio part can be used for further explaining the content of the non-audio part, and the content of the non-audio part can show the picture expressed by the content of the audio part. For example, when browsing the children's picture book, the children's picture book includes a plurality of pictures, and when browsing a picture, the children's picture book can play the content of the audio part corresponding to the picture to this can reach better browsing, reading effect. In order to produce an audio picture, an audio file for producing the content of an audio part in the audio picture and a picture file for producing the content of an audio part in a non-audio picture are required, the audio file comprising at least one audio data and the picture file comprising at least one picture; each audio data respectively comprises audio text information, each picture respectively comprises image text information, the audio text information represents text information expressed by the audio data, for example, the audio data is 'today' weather is good ', the audio text information in the audio data is' today 'weather is good', the image text information represents text information included in the picture, for example, the picture comprises text which is 'weather', and the image text information is 'weather'.

After the audio text information of the audio data is obtained, each audio data can be divided according to a preset mode to obtain a plurality of audio sub-data, wherein the audio sub-data is part of the audio data. For example, the audio data is a text including three words, where each word includes a plurality of sentences, and the audio sub-data may be one word or a single sentence. For example, the audio data is a text segment, and the text segment includes three sentences, and the audio sub-data may be each sentence.

For example, the audio data may be divided according to semantic relationships in the audio text information. The sequence of the characters in the audio text information can be identified, and the audio data are divided according to the sequence of the characters in the audio text information. For example, the audio data is the audio of a primary school lesson "shadow", including "shadow is preceding, shadow is following me, and the shadow is often following me, just like a black dog", and then the "shadow is preceding, shadow is following me, and then the shadow is often following me, just like a black dog" can be divided according to the semantic relation, so that four audio sub-data of "shadow is preceding", "shadow is following", "shadow is often following me, just like a black dog" can be obtained.

In order to correlate the audio sub-data with the pictures, for any picture, similarity between the image text information in the picture and the audio text information in each audio sub-data is calculated. It is understood that a picture may have a correspondence with one audio sub-data or may have a correspondence with a plurality of audio sub-data. For any picture, whether the picture and each audio sub-data have a corresponding relation or not can be determined according to the calculated similarity, and then the audio file and the picture file can be manufactured into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data. For example, for any picture, when the picture has a corresponding relationship with the plurality of audio sub-data, the sequence of each audio sub-data corresponding to the picture may be determined according to the sequence of each audio sub-data in the audio file.

In particular, when dividing each audio data, based on the embodiment shown in fig. 1, another data processing method is provided in the embodiment of the present invention, referring to fig. 2, and fig. 2 is a flowchart of a second data processing method provided in the embodiment of the present invention; the method comprises the following steps:

S201, an audio file and a picture file for producing an audio picture are acquired. The audio file comprises at least one audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

s202, recognizing the audio text information in the audio data by utilizing a voice recognition technology to obtain text data in the audio data;

s203, carrying out semantic relation recognition on the text data, and respectively dividing each audio data according to a semantic relation recognition result to obtain a plurality of audio sub-data;

s204, respectively calculating the similarity between the image text information in the picture and the audio text information in each audio sub-data aiming at any picture;

s205, if the similarity between the image text information in any picture and the audio text information in any audio sub-data is not less than a preset similarity threshold, recording the corresponding relation between the picture and the audio sub-data;

s206, the audio file and the picture file are manufactured into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

In the embodiment of the invention, the voice recognition technology is utilized to recognize the audio text information in the audio data, so as to obtain the text data in the audio data. Specifically, the audio data may be input into a pre-trained speech recognition model, the speech recognition model may recognize audio text information in the audio data and convert the audio data into text data, and the speech recognition model may be a deep learning network, such as any one of a convolutional neural network and a recurrent neural network, or a non-deep learning network method. With speech recognition techniques, audio data can be converted to text data, i.e., speech is converted to text.

After the text data is obtained, semantic relation recognition can be performed on the text data, for example, the text data is input into a BI-LSTM (BI-direction Long Short-Term Memory neural network) model for recognition, and the audio data are divided by recognizing the context relation of the text in the text data, so as to obtain a plurality of audio sub-data. For example, if the text data is "today's weather is hot she wears a skirt", the audio sub-data may be obtained according to the semantic relationship as "today's weather is hot", "she wears a skirt".

In one implementation manner, the number of the audio sub-data can be determined according to the number of the pictures in the picture file, then the text data is subjected to semantic relation recognition, and each audio data is divided according to the semantic relation recognition result to obtain each audio sub-data.

In one implementation, the number of pictures in a picture file is taken as the number of audio sub-data. For example, if the picture file includes 6 pictures in total, the audio file may be divided into 6 audio sub-data. In one implementation, the number of pictures in the picture file is weighted, and the weighted calculation result is used as the number of audio sub-data. For example, if the picture file includes 6 pictures in total, the pictures are weighted, for example, by dividing the audio file into 8 audio sub-data with 6+2=8, or by multiplying 6 by a weighting coefficient 2, the audio file may be divided into 12 audio sub-data, where the size of the weighting coefficient may be set according to the actual situation, which is not limited herein.

For example, if the number of pictures is N, the number of pictures is taken as the number of audio sub-data, that is, the audio is divided into N audio sub-data, if N-1 intervals are required, semantic relationship recognition can be performed on the text data, the frequency data is divided according to the semantic relationship recognition result, and then N audio sub-data are obtained.

Based on the embodiment shown in fig. 2, another data processing method is provided in the embodiment of the present invention, where the text data in the audio data includes a time stamp of each character; referring to fig. 3, fig. 3 is a flowchart of a third data processing method according to an embodiment of the present invention; the method comprises the following steps:

s301, an audio file and a picture file for producing an audio picture are acquired. The audio file comprises at least one audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

s302, recognizing the audio text information in the audio data by utilizing a voice recognition technology to obtain text data in the audio data;

s303, sequentially reading each character in the text data according to the sequence of the characters in the text data;

s304, calculating the difference value of the time stamps of adjacent characters, if the difference value is not smaller than a preset difference value threshold value, dividing the adjacent characters into two different audio sub-data, wherein the character with the early time stamp is divided into the former audio sub-data, and the character with the late time stamp is divided into the latter audio sub-data;

S305, if the difference value is smaller than the preset difference value threshold value, dividing the adjacent characters into the same audio sub-data;

s306, carrying out semantic relation recognition on the text data, and respectively dividing each audio data according to a semantic relation recognition result to obtain a plurality of audio sub-data;

s307, for any picture, respectively calculating the similarity between the image text information in the picture and the audio text information in each audio sub-data;

s308, if the similarity between the image text information in any picture and the audio text information in any audio sub-data is not less than a preset similarity threshold, recording the corresponding relation between the picture and the audio sub-data;

s309, using the corresponding relation between the picture and the audio sub-data, making the audio file and the picture file into an audio picture.

Generally, when a text is read, if a plurality of paragraphs are included in the text, pauses are required to distinguish the content between the paragraphs when the paragraphs are read. There will be a large time difference between the segments in the audio. Therefore, when the audio data includes a time stamp, and text data in the audio data is recognized, the time stamp of each character can be obtained specifically.

Calculating the interval duration between two adjacent audio sub-data by using the time stamp of each character, which can be expressed as:

spaced[i]＝sentence[i]start_time–sentence[i-1]end_time

the time of the last character of the last audio sub-data is represented by the time of the service [ i-1] end_time, the time of the first character of the current audio sub-data is represented by the time of the service [ i ] start_time, and the interval length of the current audio sub-data and the last audio sub-data is represented by the spaced [ i ].

Therefore, M interval duration can be obtained, N-1 interval duration is selected from large to small in the M interval duration, and the audio sub-data is divided at the interval corresponding to the N-1 interval duration. The audio data corresponding to one picture can be divided into one audio sub-data according to the interval time length, and the corresponding relation between the picture data and the audio sub-data is ensured, so that the content of the audio part corresponding to the picture can be heard when the picture is browsed, and the content of the audio part corresponding to the previous picture is not influenced when the picture is browsed.

For example, the timestamp corresponding to the current character is 1 st second, the timestamp corresponding to the next character is 3 rd second, the difference value between the timestamps of the current character and the next character is 2 seconds, and the difference value is greater than a preset difference threshold value of 1 second, so that a pause of 2 seconds exists between the current character and the next character, and the current character and the next character belong to different paragraph contents. In one embodiment, when the text is read, if a plurality of paragraphs are included in the text, and a plurality of sentences are included in each paragraph, a first interval time may be stopped when a next sentence is read after a current sentence is read, where the first interval time may be greater than 0.2 seconds and less than 0.6 seconds, and a second interval time may be stopped when a next paragraph is read after the current paragraph is read, where the second interval time may be greater than 0.8 seconds, and where an interval between sentences and an interval between paragraphs are set according to practical situations, which is not limited herein. Thus, each character can be divided according to the difference of the time stamps of adjacent characters.

Based on the embodiment shown in fig. 1, another data processing method is provided in the embodiment of the present invention, referring to fig. 4, and fig. 4 is a flowchart of a fourth data processing method provided in the embodiment of the present invention; the method comprises the following steps:

s401, an audio file and a picture file for producing an audio picture are acquired. The audio file comprises at least one audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

s402, dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data;

s403, aiming at any picture, identifying image text information in the picture based on text features in the picture;

s404, sequentially inputting the recognized image text information and each audio sub-data into a pre-trained matching model to obtain the matching confidence of the image text information and the audio text information in each audio sub-data;

s405, when the matching confidence coefficient of the text image data and any audio sub-data is not smaller than a preset first confidence coefficient threshold value, recording the corresponding relation between the picture and the audio sub-data;

S406, the audio file and the picture file are manufactured into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

In the embodiment of the invention, some picture data sets are collected, including texts in a picture, such as a disclosed challenge match data set, a street view text data set, a natural scene text data set and the like, based on frames such as tensorsurface (an artificial intelligent learning system), a kernel as (a neural network interface) and the like, training CTPN (a text detection neural network) and CRNN (a text recognition neural network) networks, namely, constructing an image recognition model based on tensorsurface or kernel as a frame, inputting a picture into the image recognition model, wherein the image recognition model can recognize image text information in the picture based on text features in the picture, outputting a recognition result, performing error comparison between the recognition result and a preset result, and modifying model parameters if the error is larger than a preset threshold value until the error is smaller than or equal to the preset threshold value or the number of iterations reaches the preset number, obtaining an image recognition model suitable for performing text recognition on the content in the picture data in the embodiment of the invention, inputting any picture by using the model, outputting image text information in the picture, further inputting the image recognition model, further inputting image recognition result and image recognition model 1, and further inputting image recognition result 1 as a corresponding image 1, such as a picture 1:

{imgname:‘DavidPic1’,texts:[text11,text12…]}

Wherein, imname represents the picture name, text represents the literal data, text11, text12 represents the literal data concrete content.

After obtaining the image text information of the picture 1, the similarity between the image text information and each audio sub-data needs to be calculated, specifically, the image text information and each audio sub-data can be input into a pre-trained matching model, and the matching confidence between the image text information and the audio text information in each audio sub-data can be obtained. For example, the audio sub-data includes 3 pieces of audio sub-data 1, audio sub-data 2 and audio sub-data 3, the matching confidence of the image text information 1 and the audio sub-data 1 is 90%, the matching confidence of the image text information 1 and the audio sub-data 2 is 30%, the matching confidence of the image text information 1 and the audio sub-data 1 is 60%, the preset first confidence threshold is set to be 80%, and since the matching confidence of the image text information 1 and the audio sub-data 1 is greater than the preset first confidence threshold, the recording picture 1 and the audio sub-data 1 have a corresponding relationship, that is, when the audio picture is manufactured, the picture 1 needs to be associated with the audio data 1.

Based on the embodiment shown in fig. 4, another data processing method is provided in the embodiment of the present invention, referring to fig. 5, and fig. 5 is a flowchart of a fifth data processing method provided in the embodiment of the present invention; the method comprises the following steps:

S501, an audio file and a picture file for producing an audio picture are acquired. The audio file comprises at least one audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

s502, dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data;

s503, aiming at any picture, identifying image text information in the picture based on text features in the picture;

s504, if the image text information corresponding to the plurality of pictures is the same, identifying the image text data in any picture in the picture file based on the image characteristics;

s505, sequentially matching the recognized image text data with each audio sub-data to obtain the matching confidence of the image text data and each audio sub-data;

s506, when the matching confidence of the matching result of the image text data and any audio sub-data reaches a preset second confidence threshold, recording the corresponding relation between the picture and the audio sub-data;

s507, the audio file and the picture file are manufactured into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

In the embodiment of the invention, when the image text information corresponding to a plurality of pictures is the same, for example, the text corresponding to the picture 1 and the picture 2 is the same, for example, the text included in the picture 1 is "xiaojia", and the text included in the picture 2 is "xiaojia", because the image text information in the picture 1 and the image text information in the picture 2 are the same, when the matching confidence of one audio sub-data and two image text data is the same, it is impossible to determine which picture the audio sub-data corresponds to. Or when the pictures correspond to two audio sub-data or the same audio sub-data corresponds to two pictures, the front-back sequence of the audio sub-data cannot be judged, and the front-back sequence of the two pictures cannot be judged. In view of the situation, the solution of the embodiment of the invention is to identify the image text data in the picture based on the image features, and correlate the picture data with the audio sub-data according to the matching result of the image text data and each audio sub-data.

The image character data is obtained by identifying the picture based on image feature identification, and corresponds to the image feature in the picture, specifically, the picture can be input into a pre-trained image character identification model, and the image feature identification is carried out on the picture through the image character identification model to obtain the image character data. For example, the image text data is obtained by identifying an object, scene, etc. in the picture, such as an image text data "chair" when an object chair is identified. Assuming that only the object in the picture data is identified, for picture 1, four identifiers of "chair", "boy", "puppy" and "ball" are identified in the image text data, and the identified image text data is matched with each audio sub-data. When the matching confidence of the matching result of the image text data and any audio sub-data reaches a preset second confidence threshold, recording the corresponding relation between the picture and the audio sub-data. The image-text recognition model is a model with an image-text recognition function, which is obtained by training based on a sample image in advance, and the image-text recognition model can be a model based on machine learning, for example, a model based on deep learning. The specific training process can adopt a traditional back propagation mode to realize model training, and is not repeated here.

For example, when the audio sub-data includes image text data, the audio sub-data is scored, for example, every time a tenth is generated, and finally, according to the score, it is determined whether each audio sub-data has a correspondence with the picture 1, for example, the image text information corresponding to the 3 pictures is the same, and is respectively the picture 1, the picture 2, and the picture 3, after the scoring, the matching score of the picture 1 and the audio sub-data 1 is 90%, the matching score of the picture 2 and the audio sub-data 1 is 60%, the matching score of the picture 3 and the audio sub-data 1 is 40%, and the second confidence threshold is 80%, so that it may be determined that the picture 1 has a correspondence with the audio sub-data 1. Therefore, the accuracy of the corresponding relation between the picture and the audio sub-data is improved.

The following description is made according to a specific embodiment of a combined scene, and the children's painting book belongs to a new type of books, mainly painting, and having a small number of characters attached. At present, in the process of making children's picture books, the picture book, the audio frequency and the time point corresponding to the audio frequency are mainly provided by the cooperation side, so as to ensure the picture book quality, the auditing personnel can audit whether the corresponding time point and picture are corresponding. For the partner, the time interval and the corresponding picture are filled in completely manually, so that the operation is complex and the experience is poor; the auditing personnel are faced with a lot of errors, the comparison workload is large, and the efficiency is low.

Therefore, referring to fig. 6, fig. 6 is a flowchart of a sixth data processing method according to an embodiment of the present invention; in the embodiment of the invention, a child picture book generally comprises an audio file and a plurality of pictures, after a partner uploads the plurality of pictures and the audio file in batches, a character recognition model (such as CTPN and CRNN) is trained by using a pre-collected picture data set, image character information in the picture book is recognized by using the model, and the corresponding relation between the picture book picture and the image character information is recorded.

A speech recognition model (e.g., ASR/CNN) is then trained using the pre-collected audio data, and speech recognition is used to generate a file from the audio file that is in a format similar to the content of the srt subtitle file (text-format subtitle file), abbreviated as the class srt file. This step is simply an intermediate result of the overall process, and does not need to take into account whether it is suitable for viewing, but rather in order to obtain a timestamp corresponding to a sentence in the audio. By utilizing the existing speech recognition interface, the whole text content corresponding to the audio, namely, the audio text information corresponding to the audio and word Items (arrays) of the start and stop time stamp of each word or word in the audio can be obtained, namely, one array comprises: a word or word, a start time, an end time, and punctuation. Class srt files are generated from arrays: traversing the arrays, namely starting a sentence from the first array, ending the sentence if the sentence end punctuation mark (including the sentence mark and the question mark) is contained, starting the next sentence, and dividing the array into the current sentence if the sentence end punctuation mark is not contained, and continuing traversing the next array; if the currently traversed array is the beginning of a new sentence, recording as a start flag=1, recording a start_time (starting time) corresponding to the array, and dividing a word or word content into the current sentence; when the array contains the end punctuation mark, the corresponding end_time (ending time) is recorded, the last obtained content (sentence) is expressed as content [ i ] = { content: 'xxxxx', time stamps: [ start_time, end_time ] }, time stamps are time stamps, the content is added to the current sentence, and a sentence content [ i+1] is newly built, and the steps are iterated until all the arrays are traversed, so that a plurality of sentences are obtained, and each sentence has a starting time and an ending time.

Typically, the audio text information of the audio is not perfectly matched with the image text information in the picture, which is much less or much more than the audio text information of the audio. The embodiment of the invention utilizes the existing subtitle file, trains the subtitle file segmentation model based on BiLSTM, carries on the segmentation processing to the class srt file; and obtaining each audio sub-data. The subtitle file segmentation model is a model with a subtitle file segmentation function, which is obtained by training in advance based on a sample image, and can be a model based on machine learning, for example, a model based on deep learning. The specific training process can adopt a traditional back propagation mode to realize model training, and is not repeated here.

After obtaining the image text information in the picture and each audio sub-data of the audio, judging whether the image text information of a plurality of picture is the same, when the image text information of a plurality of picture is not the same, respectively calculating the similarity between the image text information of any picture and each audio sub-data according to the image text information of any picture, and if the similarity between the image text information of any picture and the audio text information of any audio sub-data is not less than a preset similarity threshold, recording the corresponding relation between the picture and the audio sub-data. And then, the audio file and the picture file are manufactured into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data. Illustratively, a sentence paragraph matching model (e.g., LSTM) is trained, and the sentence paragraph matching model is used to obtain the image text information of the picture and the similarity of each audio sub-data. The sentence paragraph matching model is a model with a sentence paragraph matching function, which is obtained by training in advance based on a sample image, and the sentence paragraph matching model can be a model based on machine learning, for example, can be a model based on deep learning. The specific training process can adopt a traditional back propagation mode to realize model training, and is not repeated here.

When the image text information of a plurality of picture books is the same, identifying the image text data in the picture according to the image characteristics, sequentially matching the identified image text data with each audio sub-data to obtain the matching confidence coefficient of the image text data and each audio sub-data, recording the corresponding relation between the picture and the audio sub-data when the matching confidence coefficient of the matching result of the image text data and any audio sub-data reaches a preset second confidence coefficient threshold value, and manufacturing the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data.

In the embodiment of the invention, the class srt file is extracted from the audio file by utilizing a voice recognition technology, and then is matched with the image text information and the image text data in the picture, so that the audio is finally divided according to the content of the picture. The method and the device can help reduce operations when the partner submits the audio time point and the corresponding picture, improve accuracy of submitted information, improve user experience, and simultaneously help reduce workload during auditing and improve auditing efficiency.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a first data processing apparatus according to an embodiment of the present invention; the device comprises an acquisition module 710, a division module 720, a calculation module 730, a recording module 740 and a manufacturing module 750, wherein:

The acquisition module 710 is configured to acquire an audio file and a picture file for making an audio picture. The audio file comprises at least one audio data, and the picture file comprises a picture; each audio data comprises audio text information, and each picture comprises image text information;

the dividing module 720 is configured to divide each of the audio data according to the audio text information in each of the audio data, so as to obtain a plurality of audio sub-data;

a calculating module 730, configured to calculate, for any of the pictures, similarity between image text information in the picture and audio text information in each of the audio sub-data;

a recording module 740, configured to record a correspondence between any one of the pictures and any one of the audio sub-data if the similarity between the image text information in the picture and the audio text information in the audio sub-data is not less than a preset similarity threshold;

the making module 750 is configured to make the audio file and the picture file into an audio frame by using a correspondence between the picture and the audio sub-data.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a second data processing apparatus according to an embodiment of the present invention, and in one possible implementation manner, the dividing module 720 includes:

An audio text information recognition sub-module 7201 for recognizing the audio text information in the audio data by using a voice recognition technology to obtain text data in the audio data;

the first division submodule 7202 is configured to identify semantic relationships between the text data, and divide each of the audio data according to the semantic relationship identification result, so as to obtain a plurality of audio sub-data.

Based on the embodiment shown in fig. 8, the embodiment of the present invention provides another data processing apparatus, and referring to fig. 9, fig. 9 is a schematic structural diagram of a third data processing apparatus according to the embodiment of the present invention, where in a possible implementation manner, text data in the above-mentioned audio data includes a timestamp of each character; the device further comprises:

a reading submodule 7203 for sequentially reading each character in the text data according to the sequence of the characters in the text data;

a difference calculating sub-module 7204, configured to calculate a difference of time stamps of adjacent characters, and if the difference is not smaller than a preset difference threshold, divide the adjacent characters into two different audio sub-data, where a character with an early time stamp is divided into a previous audio sub-data, and a character with a late time stamp is divided into a next audio sub-data;

And a second dividing sub-module 7205 for dividing the adjacent characters into the same audio sub-data if the difference is smaller than the preset difference threshold.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a fourth data processing apparatus according to an embodiment of the present invention, in one possible implementation manner, the calculating module 730 includes:

a first image text information recognition sub-module 7301, configured to, for any of the above pictures, recognize image text information in the picture based on text features in the picture;

a first matching sub-module 7302, configured to sequentially input the identified image text information and each audio sub-data into a pre-trained matching model, so as to obtain a matching confidence degree of the image text information and the audio text information in each audio sub-data;

the recording module 740 is specifically configured to:

and when the matching confidence coefficient of the text image data and any one of the audio sub-data is not smaller than a preset first confidence coefficient threshold value, recording the corresponding relation between the picture and the audio sub-data.

In one possible embodiment, the apparatus further includes:

the second image text information identification sub-module is used for identifying the image text data in any picture in the picture file based on the image characteristics if the image text information corresponding to the pictures is the same;

the recording module 740 is specifically configured to:

The embodiment of the invention also provides an electronic device, as shown in fig. 11, fig. 11 is a schematic structural diagram of the electronic device provided by the embodiment of the invention, which includes a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, where the processor 1101, the communication interface 1102 and the memory 1103 complete communication with each other through the communication bus 1104, and the memory 1103 is used for storing a computer program;

the processor 1101 is configured to execute a program stored in the memory 1103, and implement the following steps:

and using the corresponding relation between the picture and the audio sub-data to manufacture the audio file and the picture file into an audio picture.

Optionally, the processor 1101 is configured to execute a program stored in the memory 1103, and any of the above data processing methods may be implemented.

The communication bus mentioned for the above-mentioned electronic devices may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include RAM (Random Access Memory ) or NVM (Non-Volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processor ), ASIC (Application Specific Integrated Circuit, application specific integrated circuit), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In an embodiment of the present application, there is also provided a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform any of the data processing methods of the above embodiments.

In an embodiment of the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the data processing methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, DSL (Digital Subscriber Line, digital subscriber line)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD (Digital Versatile Disc, digital versatile Disk)), or a semiconductor medium (e.g., an SSD (Solid State Disk)), or the like.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, electronic device, computer readable storage medium, and computer program product embodiments, the description is relatively simple, as relevant to the method embodiments being referred to in the section of the description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of data processing, the method comprising:

When the matching confidence of the matching result of the image text data and any audio sub-data reaches a preset second confidence threshold, recording the corresponding relation between the picture and the audio sub-data;

2. The method of claim 1, wherein the dividing each audio data according to the audio text information in each audio data to obtain a plurality of audio sub-data includes:

3. The method of claim 2, wherein the text data in the audio data includes a timestamp of each character; after the step of recognizing the audio text information in the audio data by using the voice recognition technology to obtain text data in the audio data, the method further comprises:

4. A data processing apparatus, the apparatus comprising:

the manufacturing module is used for manufacturing the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio sub-data;

the computing module includes:

the recording module is specifically used for:

5. The apparatus of claim 4, wherein the partitioning module comprises:

6. The apparatus of claim 5, wherein the text data in the audio data comprises a timestamp of each character; the apparatus further comprises:

7. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor being adapted to carry out the method of any of claims 1-3 when executing the computer program stored on the memory.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-3.