CN111966839A

CN111966839A - Data processing method and device, electronic equipment and computer storage medium

Info

Publication number: CN111966839A
Application number: CN202010826912.XA
Authority: CN
Inventors: 王睿宇; 程启健; 尚岩; 任翔宇; 张笑强
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-11-20
Anticipated expiration: 2040-08-17
Also published as: CN111966839B

Abstract

The embodiment of the invention provides a data processing method, a data processing device, electronic equipment and a computer storage medium, wherein audio files and picture files used for making audio pictures are obtained, each audio data is divided according to audio character information in each audio data to obtain a plurality of audio subdata aiming at any picture, the similarity between the image character information in the picture and the audio character information in each audio subdata is respectively calculated, and if the similarity between the image character information in any picture and the audio character information in any audio subdata is not less than a preset similarity threshold value, the corresponding relation between the picture and the audio subdata is recorded; the corresponding relation between the picture and the audio subdata is utilized to manufacture the audio file and the picture file into an audio picture, so that the content of the audio part and the content of the non-audio part are automatically matched, and the efficiency of manufacturing the audio picture is improved.

Description

Data processing method and device, electronic equipment and computer storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a computer storage medium.

Background

Reading can help people to understand the world, learn knowledge, cultivate good hobbies and interests and improve thinking ability. The traditional way of reading is to obtain information by visual browsing. In order to improve the interest of people in reading, a novel reading mode is developed, and information can be acquired in a mode of combining auditory sense and visual sense, such as audio books, children's sketches, adult sketches and the like. When the auditory sense is taken as a main mode and the visual sense is taken as an auxiliary mode, the imagination of a reader can be better stimulated.

When information is acquired in a mode of combining auditory sense and visual sense, the information needs to comprise an audio part and a non-audio part, the non-audio part can be characters, images or pictures and the like, the audio part can explain the content of the non-audio part, in order to better understand the content of books, the content in the audio part and the content in the non-audio part need to be in one-to-one correspondence, and therefore when the content of the non-audio part is read randomly, the corresponding audio part can be played automatically. In the prior art, the contents of the audio part and the contents of the non-audio part in audio books, children's sketches, adults' sketches and the like need to be matched manually, the whole process is complex to operate and is easy to generate errors, and the efficiency is low.

Disclosure of Invention

Embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a computer storage medium, so as to automatically match a content of an audio portion with a content of a non-audio portion. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:

acquiring an audio file and a picture file for making an audio picture, wherein the audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

according to the audio character information in each audio data, dividing each audio data to obtain a plurality of audio subdata;

for any one picture, respectively calculating the similarity between the image text information in the picture and the audio text information in each audio subdata;

if the similarity between the image text information in any one of the pictures and the audio text information in any one of the audio subdata is not smaller than a preset similarity threshold, recording the corresponding relation between the picture and the audio subdata;

and making the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio subdata.

Optionally, the dividing, according to the audio text information in each audio data, each audio data to obtain a plurality of audio subdata includes:

recognizing audio character information in the audio data by utilizing a voice recognition technology to obtain character data in the audio data;

and performing semantic relation recognition on the character data, and dividing each audio data according to a semantic relation recognition result to obtain a plurality of audio subdata.

Optionally, the text data in the audio data includes time stamps of the respective characters; after the step of recognizing the audio text information in the audio data by using the voice recognition technology to obtain the text data in the audio data, the method further includes:

reading each character in the character data in sequence according to the sequence of the characters in the character data;

calculating a difference value of time stamps of adjacent characters, and if the difference value is not smaller than a preset difference value threshold value, dividing the adjacent characters into two different audio subdata, wherein the character with an early time stamp is divided into a former audio subdata, and the character with a late time stamp is divided into a latter audio subdata;

and if the difference value is smaller than the preset difference value threshold value, dividing the adjacent characters into the same audio subdata.

Optionally, the calculating, for any one of the pictures, a similarity between image text information in the picture and audio text information in each of the audio subdata includes:

aiming at any one picture, identifying image character information in the picture based on character features in the picture;

inputting the recognized image character information and each audio subdata into a pre-trained matching model in sequence to obtain the matching confidence coefficient of the image character information and the audio character information in each audio subdata;

if the similarity between the image text information in any one of the pictures and the audio text information in any one of the audio subdata is not smaller than a preset similarity threshold, recording the corresponding relationship between the picture and the audio subdata, including:

and when the matching confidence of the text image data and any one of the audio subdata is not less than a preset first confidence threshold, recording the corresponding relation between the picture and the audio subdata.

Optionally, after identifying image text information in any one of the pictures, the method further includes:

if the image character information corresponding to the pictures is the same, identifying the image character data in the picture based on the image characteristics aiming at any picture in the picture file;

matching the identified image text data with each audio subdata in sequence to obtain matching confidence coefficients of the image text data and each audio subdata;

when the matching confidence of the text image data and any one of the audio subdata is not smaller than a preset first confidence threshold, recording the corresponding relation between the picture and the audio subdata, including:

and when the matching confidence coefficient of the image text data and any audio subdata reaches a preset second confidence coefficient threshold value, recording the corresponding relation between the image and the audio subdata.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring an audio file and a picture file for making an audio picture, the audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

the dividing module is used for dividing each audio data according to the audio character information in each audio data to obtain a plurality of audio subdata;

the calculation module is used for respectively calculating the similarity between the image text information in the picture and the audio text information in each piece of audio subdata aiming at any picture;

the recording module is used for recording the corresponding relation between the picture and the audio subdata if the similarity between the image text information in any picture and the audio text information in any audio subdata is not smaller than a preset similarity threshold;

and the making module is used for making the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio subdata.

Optionally, the dividing module includes:

the audio character information identification submodule is used for identifying audio character information in the audio data by utilizing a voice identification technology to obtain character data in the audio data;

and the first dividing module is used for performing semantic relation identification on the character data and dividing each audio data according to a semantic relation identification result to obtain a plurality of audio subdata.

Optionally, the text data in the audio data includes time stamps of the respective characters; the device further comprises:

the reading sub-module is used for sequentially reading each character in the character data according to the sequence of the characters in the character data;

the difference value calculation submodule is used for calculating the difference value of the time stamps of the adjacent characters, if the difference value is not smaller than a preset difference value threshold value, the adjacent characters are divided into two different audio subdata, wherein the characters with the early time stamps are divided into the former audio subdata, and the characters with the late time stamps are divided into the latter audio subdata;

and the second division submodule is used for dividing the adjacent characters into the same audio subdata if the difference value is smaller than the preset difference value threshold.

Optionally, the calculation module includes:

the first image character information identification submodule is used for identifying the image character information in the picture based on character features in the picture aiming at any one picture;

the first matching submodule is used for inputting the identified image character information and each audio subdata into a pre-trained matching model in sequence to obtain the matching confidence coefficient of the image character information and the audio character information in each audio subdata;

the recording module is specifically configured to:

Optionally, the apparatus further comprises:

the second image character information identification submodule is used for identifying image character data in the picture based on image characteristics aiming at any picture in the picture file if the image character information corresponding to the pictures is the same;

the second matching submodule is used for matching the identified image text data with each audio subdata in sequence to obtain the matching confidence coefficient of the image text data and each audio subdata;

the recording module is specifically configured to:

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method according to any of the first aspect when executing the computer program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method according to any one of the first aspect.

The data processing method, the data processing device, the electronic equipment and the computer storage medium provided by the embodiment of the invention can acquire an audio file and a picture file for making an audio picture, wherein the audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information; according to the audio character information in each audio data, dividing each audio data respectively to obtain a plurality of audio subdata; for any picture, respectively calculating the similarity between the image text information in the picture and the audio text information in each audio subdata; if the similarity between the image text information in any picture and the audio text information in any audio subdata is not smaller than a preset similarity threshold, recording the corresponding relation between the picture and the audio subdata; and making the audio file and the picture file into an audio picture by utilizing the corresponding relation between the picture and the audio subdata.

In the embodiment of the invention, the audio data can be automatically divided into a plurality of audio subdata, the similarity between the image text information in the picture and the audio text information in each audio subdata is respectively calculated for any picture, and the picture and the audio subdata are corresponding according to the similarity. By applying the embodiment of the invention, the automatic association between each picture and the audio subdata of the audio data can be realized, and the automatic matching of the content of the audio part and the content of the non-audio part can be realized. And the audio file and the picture file are made into an audio picture by utilizing the corresponding relation between the picture and the audio subdata, so that the efficiency of making the audio picture is improved. Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a first data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a second data processing method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a third data processing method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a fourth data processing method according to an embodiment of the present invention;

fig. 5 is a flowchart of a fifth data processing method according to an embodiment of the present invention;

fig. 6 is a flowchart of a sixth data processing method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a first data processing apparatus according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a second data processing apparatus according to an embodiment of the present invention;

FIG. 9 is a block diagram of a third data processing apparatus according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating a fourth data processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problem that the content of an audio part and the content of a non-audio part in an audio picture need to be matched manually in the prior art, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, a computer storage medium, and a computer program product containing instructions.

Next, a data processing method provided in an embodiment of the present invention is first described. The method is applied to an electronic device, and specifically, the electronic device may be any electronic device that can provide data processing services, such as a personal computer, a server, and the like. The data processing method provided by the embodiment of the invention can be realized by at least one of software, hardware circuit and logic circuit arranged in the electronic equipment.

An embodiment of the present invention provides a data processing method, and referring to fig. 1, fig. 1 is a flowchart of a first data processing method provided in an embodiment of the present invention; the method comprises the following steps:

s101, acquiring an audio file and a picture file for making an audio picture. The audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information.

And S102, dividing the audio data respectively according to the audio character information in the audio data to obtain a plurality of audio subdata.

S103, for any one of the pictures, respectively calculating the similarity between the image character information in the picture and the audio character information in each piece of the audio sub-data.

And S104, if the similarity between the image text information in any one of the pictures and the audio text information in any one of the audio subdata is not smaller than a preset similarity threshold, recording the corresponding relation between the picture and the audio subdata.

And S105, making the audio file and the picture file into an audio picture by using the corresponding relation between the picture and the audio subdata.

In the embodiment of the invention, the audio data can be automatically divided into a plurality of audio subdata, the similarity between the image text information in the picture and the audio text information in each audio subdata is respectively calculated for any picture, and the picture and the audio subdata are corresponding according to the similarity, so that the automatic establishment of association between each picture and the audio subdata of the audio data is realized, and the automatic matching of the content of an audio part and the content of a non-audio part is realized. And the audio file and the picture file are made into an audio picture by utilizing the corresponding relation between the picture and the audio subdata, so that the efficiency of making the audio picture is improved.

The audio frame can be an audio book, a children's picture, an adult's picture, etc., and the audio frame includes the content of the audio part and the content of the non-audio part, wherein the content of the audio part can be used to further explain the content of the non-audio part, and the content of the non-audio part can show the frame expressed by the content of the audio part. For example, when browsing the children picture book, the children picture book includes a plurality of pictures, and when browsing one picture, the children picture book plays the content of the audio part corresponding to the picture, so as to achieve better browsing and reading effects. In order to make an audio picture, an audio file and a picture file are needed, wherein the audio file is used for making the content of an audio part in the audio picture, the picture file is used for making the content of an audio part in a non-audio picture, the audio file comprises at least one piece of audio data, and the picture file comprises at least one picture; each audio data comprises audio character information, each picture comprises image character information, the audio character information represents text information expressed by the audio data, for example, the audio data is 'good weather today', the audio character information in the audio data is 'good weather today', the image character information represents text information included in the picture, for example, the text included in the picture is 'weather', and the image character information is 'weather'.

After the audio character information of the audio data is obtained, the audio data can be divided according to a preset mode to obtain a plurality of audio subdata, wherein the audio subdata is part of data in the audio data. For example, the audio data is a text including three paragraphs of text, where each paragraph of text includes a plurality of sentences, and the sub-audio data may be a paragraph of text or a single sentence. For example, if the audio data is a segment of text including three sentences, the audio sub-data may be each sentence.

For example, the audio data may be divided according to semantic relationships in the audio text information. And the sequence of the characters in the audio character information can be identified, and the audio data are divided according to the sequence of the characters in the audio character information. For example, the audio data is the audio of a shadow of a primary school text, and comprises four audio subdata, namely that the shadow is in front and behind, and the shadow always follows me and is just like a little black dog, and the four audio subdata are divided according to semantic relation, namely that the shadow is in front, the shadow is behind, the shadow always follows me and is just like a little black dog, so that four audio subdata, namely that the shadow is in front, the shadow is behind, the shadow is always followed me and is just like a little black dog, can be obtained.

In order to associate the audio subdata with the pictures, for any picture, the similarity between the image text information in the picture and the audio text information in each audio subdata is calculated respectively. It is understood that one picture may have a corresponding relationship with one audio sub data, or may have a corresponding relationship with a plurality of audio sub data. For any picture, whether the picture and each audio subdata have a corresponding relation or not can be determined according to the calculated similarity, and then the audio file and the picture file can be made into an audio picture by using the corresponding relation between the picture and the audio subdata. For example, for any picture, when the picture has a corresponding relationship with a plurality of audio sub-data, the order of the audio sub-data corresponding to the picture may be determined according to the order of the audio sub-data in the audio file.

Specifically, when dividing each audio data, based on the embodiment shown in fig. 1, another data processing method is provided in the embodiment of the present invention, referring to fig. 2, where fig. 2 is a flowchart of a second data processing method provided in the embodiment of the present invention; the method comprises the following steps:

s201, acquiring an audio file and a picture file for making an audio picture. The audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

s202, recognizing audio character information in the audio data by utilizing a voice recognition technology to obtain character data in the audio data;

s203, performing semantic relation recognition on the character data, and dividing each audio data according to a semantic relation recognition result to obtain a plurality of audio subdata;

s204, aiming at any one of the pictures, respectively calculating the similarity between the image character information in the picture and the audio character information in the audio subdata;

s205, if the similarity between the image text information in any one of the pictures and the audio text information in any one of the audio subdata is not smaller than a preset similarity threshold, recording the corresponding relation between the picture and the audio subdata;

s206, the audio file and the picture file are made into an audio picture by utilizing the corresponding relation between the picture and the audio subdata.

In the embodiment of the invention, the voice recognition technology is utilized to recognize the audio character information in the audio data, and the character data in the audio data is obtained. Specifically, the audio data may be input into a pre-trained speech recognition model, the speech recognition model may recognize audio text information in the audio data, and convert the audio data into text data, and the speech recognition model may be a deep learning network, such as any one of a convolutional neural network or a recurrent neural network, or a non-deep learning network method. With speech recognition technology, audio data can be converted into text data, i.e. speech is converted into text.

After the text data is obtained, semantic relationship recognition may be performed on the text data, for example, the text data is input into a BI-directional Long Short-Term Memory (BI-LSTM) model for recognition, and the audio data is divided by recognizing a context relationship of a text in the text data to obtain a plurality of audio subdata. For example, if the text data is "she wears a skirt in hot weather today", the audio sub-data can be obtained as "hot weather today" and "she wears a skirt" according to the semantic relationship.

In an implementation manner, the number of the audio subdata may be determined according to the number of the pictures in the picture file, then the semantic relationship recognition is performed on the text data, and each audio data is divided according to the semantic relationship recognition result to obtain each audio subdata.

In one implementation, the number of pictures in the picture file is taken as the number of audio sub-data. For example, if the picture file includes 6 pictures in total, the audio file may be divided into 6 pieces of audio sub-data. In an implementation manner, the number of pictures in the picture file is weighted, and the weighted result is used as the number of the audio sub-data. For example, if the picture file includes 6 pictures in total, the pictures are weighted, for example, 6+2 is 8, the audio file may be divided into 8 pieces of audio sub-data, or 6 is multiplied by a weighting coefficient 2, the audio file may be divided into 12 pieces of audio sub-data, where the size of the weighting coefficient may be set according to actual circumstances, and is not limited herein.

Illustratively, if the number of pictures is N, the number of pictures is used as the number of audio sub-data, that is, if the audio is divided into N audio sub-data, N-1 intervals are required, the semantic relationship identification can be performed on the text data, the audio data is divided according to the semantic relationship identification result, and then N audio sub-data are obtained.

Based on the embodiment shown in fig. 2, another data processing method is provided in the embodiment of the present invention, where the text data in the audio data includes time stamps of respective characters; referring to fig. 3, fig. 3 is a flowchart of a third data processing method according to an embodiment of the present invention; the method comprises the following steps:

s301, an audio file and a picture file for making an audio picture are obtained. The audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

s302, recognizing audio character information in the audio data by utilizing a voice recognition technology to obtain character data in the audio data;

s303, reading each character in the character data in sequence according to the sequence of the characters in the character data;

s304, calculating a difference value of time stamps of adjacent characters, and if the difference value is not smaller than a preset difference value threshold value, dividing the adjacent characters into two different audio subdata, wherein the character with the early time stamp is divided into a former audio subdata, and the character with the late time stamp is divided into a latter audio subdata;

s305, if the difference value is smaller than the preset difference value threshold value, dividing the adjacent characters into the same audio subdata;

s306, performing semantic relation recognition on the character data, and dividing each audio data according to a semantic relation recognition result to obtain a plurality of audio subdata;

s307, for any one of the pictures, respectively calculating the similarity between the image character information in the picture and the audio character information in the audio subdata;

s308, if the similarity between the image text information in any one of the pictures and the audio text information in any one of the audio subdata is not smaller than a preset similarity threshold, recording the corresponding relation between the picture and the audio subdata;

s309, making the audio file and the picture file into an audio image by using the corresponding relation between the picture and the audio subdata.

Generally, when reading a text, if the text includes a plurality of paragraphs, a pause is needed to distinguish the contents between the paragraphs when reading the paragraphs. There will be a large time difference between paragraphs in the audio. Therefore, when the audio data includes a time stamp to identify the text data in the audio data, the time stamp of each character can be obtained specifically.

Calculating the interval duration between two adjacent audio subdata by using the time stamp of each character, which can be specifically expressed as:

spaced[i]＝sentence[i]start_time–sentence[i-1]end_time

wherein, the sensor [ i-1] end _ time represents the time stamp of the last character of the last audio sub-data, the sensor [ i ] start _ time represents the time stamp of the first character of the current audio sub-data, and the spaced [ i ] represents the interval duration of the current audio sub-data and the last audio sub-data.

Therefore, M interval durations can be obtained, N-1 interval durations are selected from the M interval durations from large to small, and the audio sub-data is divided at the intervals corresponding to the N-1 interval durations. The division is carried out according to the interval duration, the audio data corresponding to one picture can be divided into one piece of audio subdata, the corresponding relation between the picture data and the audio subdata is ensured, the content of the audio part corresponding to the picture can be heard when the picture is browsed, and the influence of the content of the audio part corresponding to the previous picture when the picture is browsed is avoided.

For example, if the timestamp corresponding to the current character is 1 st second, the timestamp corresponding to the next character is 3 rd second, and the difference between the timestamps of the current character and the next character is 2 seconds and is greater than the preset difference threshold value by 1 second, it is determined that there is a pause of 2 seconds between the current character and the next character, and it is determined that the current character and the next character belong to different paragraph contents. In one embodiment, when reading the text, if the text includes a plurality of paragraphs each including a plurality of sentences, the reading of the next sentence after reading the current sentence may stay for a first interval time, where the first interval time may be greater than 0.2 second and less than 0.6 second, and when reading the next sentence after reading the current sentence, the reading may stay for a second interval time, where the second interval time may be greater than 0.8 second, and where the intervals between sentences and between paragraphs are set according to practical situations, which is not limited herein. Thereby, each character can be divided according to the difference of the time stamps of the adjacent characters.

Based on the embodiment shown in fig. 1, another data processing method is provided in the embodiment of the present invention, referring to fig. 4, fig. 4 is a flowchart of a fourth data processing method provided in the embodiment of the present invention; the method comprises the following steps:

s401, acquiring an audio file and a picture file for making an audio picture. The audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

s402, dividing each audio data according to the audio character information in each audio data to obtain a plurality of audio subdata;

s403, for any one of the pictures, identifying image character information in the picture based on character features in the picture;

s404, inputting the recognized image character information and each audio subdata into a pre-trained matching model in sequence to obtain the matching confidence coefficient of the image character information and the audio character information in each audio subdata;

s405, when the matching confidence coefficient of the digital image data and any one of the audio subdata is not smaller than a preset first confidence coefficient threshold value, recording the corresponding relation between the image and the audio subdata;

s406, the audio file and the picture file are made into an audio image by utilizing the corresponding relation between the picture and the audio subdata.

In the embodiment of the invention, some picture data sets are collected, the pictures comprise texts, such as a public challenge match data set, a street view text data set, a natural scene text data set and the like, frames such as tensoflow (an artificial intelligent learning system) and keras (a neural network interface) are based, a CTPN (text detection neural network) and a CRNN (text recognition neural network) network are trained, namely, an image recognition model is constructed by taking tensoflow or keras a frame basis, the pictures are input into the image recognition model, the image recognition model can recognize image character information in the pictures based on character features in the pictures, the recognition result is output, the recognition result is compared with a preset result in an error mode, if the error is greater than the preset threshold value, the model parameters are modified until the error is less than or equal to the preset threshold value or the iteration times reach the preset times, after the training process is finished, an image recognition model for performing character recognition on the content in the picture data is obtained, and any picture is input by using the model, and the image character information in the picture can be output, further, the image recognition model can also record the corresponding relationship between the picture and the image character information, for example, if picture 1 is input, and the name of picture 1 is "DavidPic 1", after picture 1 is input into the image recognition model, the output result of the image recognition model can be:

{imgname:‘DavidPic1’,texts:[text11,text12…]}

here, imgname indicates a picture name, texts indicates character data, and text11 and text12 indicate specific contents of the character data.

After the image text information of the picture 1 is obtained, the similarity between the image text information and each audio subdata needs to be calculated, specifically, the image text information and each audio subdata can be input into a pre-trained matching model, and the matching confidence of the image text information and the audio text information in each audio subdata is obtained. For example, the audio subdata includes 3 pieces of audio subdata 1, 2 pieces of audio subdata and 3 pieces of audio subdata, the matching confidence of the obtained image text information 1 and the audio subdata 1 is 90%, the matching confidence of the image text information 1 and the audio subdata 2 is 30%, the matching confidence of the image text information 1 and the audio subdata 1 is 60%, a preset first confidence threshold is set to be 80%, because the matching confidence of the image text information 1 and the audio subdata 1 is greater than the preset first confidence threshold, the picture 1 and the audio subdata 1 are recorded to have a corresponding relationship, that is, when an audio picture is made, the picture 1 needs to be associated with the audio data 1.

Based on the embodiment shown in fig. 4, another data processing method is provided in the embodiment of the present invention, referring to fig. 5, and fig. 5 is a flowchart of a fifth data processing method provided in the embodiment of the present invention; the method comprises the following steps:

s501, acquiring an audio file and a picture file for making an audio picture. The audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

s502, dividing each audio data according to the audio character information in each audio data to obtain a plurality of audio subdata;

s503, aiming at any one picture, identifying image character information in the picture based on character features in the picture;

s504, if the image character information corresponding to the pictures is the same, identifying the image character data in the picture based on the image characteristics aiming at any picture in the picture file;

s505, matching the identified image text data with each audio subdata in sequence to obtain matching confidence of the image text data and each audio subdata;

s506, when the matching confidence coefficient of the image character data and any audio subdata reaches a preset second confidence coefficient threshold value, recording the corresponding relation between the image and the audio subdata;

and S507, making the audio file and the picture file into an audio image by using the corresponding relation between the picture and the audio subdata.

In the embodiment of the present invention, when the image text information corresponding to a plurality of pictures is the same, for example, the texts corresponding to the pictures 1 and 2 are the same, for example, the text included in the picture 1 is "little mingjia", and the text included in the picture 2 is "little mingjia", because the image text information in the picture 1 is the same as the image text information in the picture 2, when the matching confidence of one piece of audio sub-data and two pieces of image text data is the same, it cannot be determined which picture the audio sub-data corresponds to. Or, when the picture corresponds to two audio subdata, or the same audio subdata corresponds to two pictures, the front-back sequence of the audio subdata cannot be judged, and the front-back sequence of the two pictures cannot be judged. For the situation, the solution of the embodiment of the present invention is to identify the image text data in the picture based on the image features, and perform association between the image data and the audio sub-data according to the matching result between the image text data and each audio sub-data.

The image character data is obtained by identifying the picture based on image feature identification, and corresponds to the image features in the picture, specifically, the picture can be input into a pre-trained image character identification model, and the image feature identification is performed on the picture through the image character identification model, so that the image character data is obtained. Illustratively, the image text data is obtained by recognizing an object, a scene, etc. in the picture, such as recognizing an object chair to obtain the image text data "chair". Assuming that only objects in the picture data are identified, for picture 1, the image character data are identified to have four identifications of chair, boy, puppy and ball, and the identified image character data are matched with the audio subdata. And when the matching confidence coefficient of the image text data and any audio subdata reaches a preset second confidence coefficient threshold value, recording the corresponding relation between the image and the audio subdata. The image character recognition model is a model with an image character recognition function obtained by pre-training based on a sample image, and the image character recognition model may be a model based on machine learning, for example, a model based on deep learning. The specific training process may implement model training in a traditional back propagation manner, which is not described herein again.

Illustratively, when the audio subdata includes image text data, scoring the audio subdata, for example, scoring one by one to ten, and finally determining whether each audio subdata has a corresponding relationship with the picture 1 according to the score, for example, if the image text information corresponding to 3 pictures is the same, the picture 1, the picture 2, and the picture 3, the matching score of the picture 1 and the audio subdata 1 is 90%, the matching score of the picture 1 and the audio subdata 1 is 60%, the matching score of the picture 1 and the audio subdata 1 is 40%, and the second confidence threshold is 80%, it may be determined that the picture 1 and the audio subdata 1 have a corresponding relationship. Therefore, the accuracy of the corresponding relation between the picture and the audio subdata is improved.

The following description is based on a specific embodiment combined with a scene, and the children's book is a kind of book which mainly uses painting and is attached with a small amount of characters. At present, in the process of manufacturing a children picture book, a partner provides a picture of the picture book, an audio frequency and a time point corresponding to the picture and the audio frequency, and in order to ensure the quality of the picture book, an auditor can audit whether the corresponding time point corresponds to the picture. For the partner, the time interval and the corresponding picture are filled in manually, so that the operation is complicated and the experience is poor; for the auditors, many errors are faced, the comparison workload is large, and the efficiency is low.

Therefore, an embodiment of the present invention provides a data processing method, and referring to fig. 6, fig. 6 is a flowchart of a sixth data processing method provided in the embodiment of the present invention; in the embodiment of the invention, a children picture usually comprises an audio file and a plurality of pictures, after a partner uploads the plurality of pictures and the audio file in batch, firstly, a character recognition model (such as CTPN and CRNN) is trained by utilizing a pre-collected picture data set, image character information in the picture of the picture is recognized by utilizing the model, and the corresponding relation between the picture of the picture.

A speech recognition model (e.g., ASR/CNN) is then trained using the pre-collected audio data, and a file resembling srt subtitle files (text-formatted subtitle files) in content format, referred to as a class srt file, is generated from the audio file using speech recognition. This step is only an intermediate result of the whole process and does not need to consider whether it is suitable for viewing, but rather to obtain the corresponding time stamp of the sentence in the audio. By utilizing the existing voice recognition interface, the whole text content corresponding to the audio frequency can be obtained, namely, the audio frequency character information corresponding to the audio frequency, and the word Items (array) of the start-stop time stamp of each word or word in the audio frequency, namely, one array comprises: a word or word, start time, end time, punctuation. Generating a class srt file according to an array: traversing the arrays, starting a sentence from the first array, ending the sentence if the sentence contains the last punctuation marks (including the sentence numbers and the question marks), starting the next sentence, dividing the arrays into the current sentence if the sentence does not contain the last punctuation marks, and continuously traversing the next array; if the currently traversed array is the beginning of a new sentence, recording the start Flag as 1, recording the start _ time (starting time) corresponding to the array, and dividing the word or word content into the current sentence; when the array contains a sentence end punctuation mark, recording the end _ time (end time) corresponding to the array, representing the finally obtained sensor (sentence) as sensor [ i ] ═ content: 'xxxxx', timetags: [ start _ time, end _ time ] }, and using timetags as time stamps, adding the content into the current sentence, newly creating a sentence sensor [ i +1], iterating the steps until all the arrays are traversed, and obtaining a plurality of sentences, wherein each sentence has a start time and an end time.

Usually, the audio text information of the audio is not completely matched with the image text information in the picture, and the image text information in the picture is much less or much more than the audio text information of the audio. The embodiment of the invention utilizes the existing subtitle file and trains a subtitle file segmentation model based on the BilSTM to perform segmentation processing on the class srt file; and obtaining each audio subdata. The subtitle file segmentation model is a model with a subtitle file segmentation function obtained by pre-training based on a sample image, and the subtitle file segmentation model may be a model based on machine learning, for example, a model based on deep learning. The specific training process may implement model training in a traditional back propagation manner, which is not described herein again.

After obtaining the image character information in the picture of the picture book and each audio subdata of the audio frequency, judging whether the image character information of a plurality of picture book pictures is the same, when the image character information of a plurality of picture book pictures is not the same, respectively calculating the similarity between the image character information of the picture and each audio subdata aiming at the image character information of any picture book picture, and recording the corresponding relation between the picture and the audio subdata if the similarity between the image character information of any picture and the audio character information of any audio subdata is not less than a preset similarity threshold value. And then, the audio file and the picture file are made into an audio picture by utilizing the corresponding relation between the picture and the audio subdata. Illustratively, a sentence and paragraph matching model (such as LSTM) is trained, and the similarity between the image text information of the picture and each piece of audio subdata is obtained by using the sentence and paragraph matching model. The sentence paragraph matching model is a model with a sentence paragraph matching function trained in advance based on the sample image, and the sentence paragraph matching model may be a model based on machine learning, for example, a model based on deep learning. The specific training process may implement model training in a traditional back propagation manner, which is not described herein again.

When the image character information of a plurality of picture books is the same, identifying the image character data in any picture in the picture file based on the image characteristics, sequentially matching the identified image character data with each audio subdata to obtain the matching confidence coefficient of the image character data and each audio subdata, recording the corresponding relation between the picture and the audio subdata when the matching confidence coefficient of the image character data and any audio subdata reaches a preset second confidence coefficient threshold, and making the audio file and the picture file into an audio picture by using the corresponding relation between the picture and the audio subdata.

In the embodiment of the invention, the class srt file is extracted from the audio file by using a voice recognition technology, and then the class srt file is matched with the image and character information and the image and character data in the picture, so that the audio is finally divided according to the picture content. The method and the device can help reduce the operation of the partner in submitting the audio time point and the corresponding picture, improve the accuracy of information submission, improve the user experience, reduce the workload in auditing and improve the auditing efficiency.

An embodiment of the present invention provides a data processing apparatus, referring to fig. 7, where fig. 7 is a schematic structural diagram of a first data processing apparatus according to an embodiment of the present invention; the device comprises an acquisition module 710, a dividing module 720, a calculating module 730, a recording module 740 and a making module 750, wherein:

the acquisition module 710 is configured to acquire an audio file and a picture file for making an audio picture. The audio file comprises at least one piece of audio data, and the picture file comprises a picture; each audio data comprises audio character information, and each picture comprises image character information;

a dividing module 720, configured to divide each piece of audio data according to audio text information in each piece of audio data, to obtain a plurality of pieces of audio subdata;

a calculating module 730, configured to calculate, for any one of the pictures, similarity between image text information in the picture and audio text information in each of the audio sub-data;

a recording module 740, configured to record a corresponding relationship between an image text information in any one of the pictures and audio text information in any one of the audio sub-data if a similarity between the image text information and the audio text information is not smaller than a preset similarity threshold;

the making module 750 is configured to make the audio file and the picture file into an audio frame by using a corresponding relationship between the picture and the audio sub-data.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a second data processing apparatus according to an embodiment of the present invention, in a possible implementation manner, the dividing module 720 includes:

an audio text information identification submodule 7201 configured to identify audio text information in the audio data by using a speech recognition technology to obtain text data in the audio data;

the first partitioning module 7202 is configured to perform semantic relationship recognition on the text data, and partition each audio data according to a semantic relationship recognition result to obtain a plurality of audio subdata.

Based on the embodiment shown in fig. 8, another data processing apparatus is provided in the embodiment of the present invention, referring to fig. 9, fig. 9 is a schematic structural diagram of a third data processing apparatus provided in the embodiment of the present invention, in a possible implementation manner, text data in the audio data includes time stamps of respective characters; the above-mentioned device still includes:

the reading sub-module 7203 is configured to sequentially read each character in the text data according to the sequence of the characters in the text data;

a difference value calculating submodule 7204 configured to calculate a difference value between time stamps of adjacent characters, and if the difference value is not smaller than a preset difference value threshold, divide the adjacent characters into two different audio subdata, where a character with an early time stamp is divided into a previous audio subdata, and a character with a late time stamp is divided into a next audio subdata;

the second division submodule 7205 is configured to, if the difference is smaller than the preset difference threshold, divide the adjacent characters into the same audio sub data.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a fourth data processing apparatus according to an embodiment of the present invention, in a possible implementation manner, the calculating module 730 includes:

a first image character information identification sub-module 7301, configured to identify, for any one of the pictures, image character information in the picture based on character features in the picture;

a first matching sub-module 7302, configured to input the identified image text information and each audio sub-data into a pre-trained matching model in sequence, so as to obtain a matching confidence of the image text information and the audio text information in each audio sub-data;

the recording module 740 is specifically configured to:

and when the matching confidence of the digital image data and any one of the audio subdata is not smaller than a preset first confidence threshold, recording the corresponding relation between the image and the audio subdata.

In a possible embodiment, the above apparatus further comprises:

the recording module 740 is specifically configured to:

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and includes a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, where the processor 1101, the communication interface 1102, and the memory 1103 complete communication with each other through the communication bus 1104, and the memory 1103 is used for storing a computer program;

the processor 1101 is configured to implement the following steps when executing the program stored in the memory 1103:

dividing each audio data according to the audio character information in each audio data to obtain a plurality of audio subdata;

for any one of the pictures, respectively calculating the similarity between the image character information in the picture and the audio character information in the audio subdata;

Optionally, the processor 1101 is configured to implement any one of the data processing methods described above when executing the program stored in the memory 1103.

The communication bus mentioned in the electronic device may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a RAM (Random Access Memory) or an NVM (Non-Volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In an embodiment of the present application, a computer-readable storage medium is further provided, where instructions are stored in the storage medium, and when the instructions are executed on a computer, the instructions cause the computer to execute any one of the data processing methods in the foregoing embodiments.

In an embodiment of the present application, there is also provided a computer program product containing instructions, which when run on a computer, cause the computer to execute any of the above data processing methods in the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber, DSL (Digital Subscriber Line)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD (Digital Versatile Disk)), or a semiconductor medium (e.g., an SSD (Solid State Disk)), etc.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the dividing each of the audio data according to the audio text information in each of the audio data to obtain a plurality of audio subdata comprises:

3. The method of claim 2, wherein the textual data in the audio data includes a time stamp for each character; after the step of recognizing the audio text information in the audio data by using the voice recognition technology to obtain the text data in the audio data, the method further includes:

4. The method of claim 1, wherein the calculating the similarity between the image text information in the picture and the audio text information in each piece of audio sub-data for each of the pictures comprises:

5. The method of claim 4, wherein after identifying, for any of the pictures, image text information in the picture, the method further comprises:

6. A data processing apparatus, characterized in that the apparatus comprises:

7. The apparatus of claim 6, wherein the partitioning module comprises:

8. The apparatus of claim 7, wherein the textual data in the audio data comprises a time stamp for each character; the device further comprises:

9. The apparatus of claim 6, wherein the computing module comprises:

the recording module is specifically configured to:

10. The apparatus of claim 9, further comprising:

the recording module is specifically configured to:

11. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the computer program stored in the memory, implementing the method of any of claims 1-5.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 5.