CN112151048B

CN112151048B - Method for generating and processing audio-visual data

Info

Publication number: CN112151048B
Application number: CN201910502799.7A
Authority: CN
Inventors: 李庆成; 鹿毅忠
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2024-04-02
Anticipated expiration: 2039-06-11
Also published as: CN112151048A

Abstract

The invention provides a method for generating audio-visual data, which comprises the following steps: converting the quasi audio data and/or the marking data to form corresponding first coding data and/or second coding data; converting the first coded data and/or the second coded data into corresponding first automatic identification marks and/or second automatic identification marks; embedding the first automatic identification mark and/or the second automatic identification mark into the target picture in a digital watermark mode to form an image, sound and image. The technical scheme ensures that the obtained image, sound and image can improve various inconveniences of the existing sound and image data during manufacture and playing, simultaneously can save the transmission quantity of the sound and image data and save the operation cost of the sound and image data transmission.

Description

Method for generating and processing audio-visual data

Technical Field

The invention relates to a data processing technology, in particular to a scheme for generating and processing audio-visual data; belonging to the internet data processing technology.

Background

The audio-visual data is at least formed by any two or more data of audio data, image data, text data, mark data and video data and other play control parameter data. See chinese patent applications 20171000267.0, 201910004505.2, 201910223774.3, respectively, which disclose the technical schemes of audio-visual data formation and application.

From the above mentioned chinese patent applications it is known that: the audio-visual data in the audio-visual data mainly has two types:

the first type of sound-image data is composed of a still picture and a piece of audio to be played together with the picture; for this still picture, it is collectively referred to as a picture in the present invention; and for this audio is collectively referred to herein as picture audio. In addition, data called an alignment parameter by the inventor is also designed in the sound-image data; the alignment parameters are classified into picture alignment parameters and audio alignment parameters according to their roles.

The second type of sound-image data is composed of a plurality of static pictures and a plurality of pieces of audio which are played corresponding to the static pictures; for these still pictures, they are also collectively referred to herein as pictures; and for these audio, are also collectively referred to herein as picture audio. In addition, since the pictures and the audio are plural, the alignment parameters in the sound-image data are plural; the number of alignment parameters and the number of pictures or the number of audio frequencies are corresponding; the alignment parameters in the second type of sound-image data are also classified into picture alignment parameters and audio alignment parameters as in the first type of sound-image data.

The sound-image data is a complete data object, and can be formed by splicing any existing data formats of pictures, audios and information; in a specific scheme, the related technicians can reconstruct the integrated data objects into an integrated data object with a completely new format according to specific requirements.

The sound and image is composed of at least one picture and a section of audio. After the sound-image data is generated, when the sound-image data is played, the pictures and the audios in the sound-image data are respectively extracted, and then the sound-image data are played according to a default playing mode of playing equipment or the playing control parameters carried in the sound-image data. In this way, the viewer can hear the audio information in the sound-image data while viewing the picture in the sound-image data. This is a new data and a technical scheme for playing the new data invented by the inventor. The novel data and the technical scheme for playing the novel data enable people to obtain a more convenient and effective novel means when information transmission and interactive communication are carried out.

However, since the aforementioned audio-visual data is composed of pictures and audio, there may be some defects in some practical use processes as follows:

1. in some cases, when people generate sound-image data, it is inconvenient to record corresponding audio while acquiring images; for example: when recording voice, the recorded voice is unclear because the voice condition of the recorder is bad; for another example: when sound and image data are produced, although the portable intelligent device can be conveniently used for collecting pictures (photographing), due to the limitation of environment, voice or music to be recorded can not be conveniently sent or played; etc.

2. As previously described: because the audio-video data is composed of both pictures and audio, the audio needs a certain storage space to be accommodated, so that the audio-video data occupies more space as a whole because of the need of accommodating the audio data, occupies more bandwidth resources during internet transmission, and costs more transmission traffic.

3. In some cases, it is not convenient to play the audio and picture together in the audio and picture data (e.g., the viewer is in a field environment such as a conference hall where silence is desired), so that the audio information in the audio and picture data is not available to the viewer or needs to be processed using a voice-to-text technique similar to that in WeChat. However, on the one hand, this requires more effort by the viewer, and on the other hand, since speech recognition technology has not yet matured far enough to convert speech into text without errors, it is still impossible to present accurate information to the viewer.

Disclosure of Invention

An object of the present invention is to provide a method of generating audio-visual data, by which one can use encoded data to replace audio therein when making audio-visual data, thereby improving various inconveniences of existing audio-visual data in making and playing. In order to distinguish this new audio-visual data from existing audio-visual data, this new audio-visual data is referred to as "audio-visual data" or "audio-visual image" in the following description of the present invention.

Another object of the present invention is to provide a method for processing audio-visual data, by which various data can be processed and transformed conveniently when the aforementioned audio-visual data is received, and ready for playing.

The first object of the invention is realized by adopting the following technical scheme:

there is provided an audio-visual data generating method including: converting the quasi audio data and/or the marking data to form corresponding first coding data and/or second coding data; converting the first coded data and/or the second coded data into corresponding first automatic identification marks and/or second automatic identification marks; embedding the first automatic identification mark and/or the second automatic identification mark into the target picture in a digital watermark mode to form an image, sound and image.

According to the technical scheme, the quasi audio data and/or the marking data are hidden in the target picture, so that the obtained picture-sound image improves various inconveniences of the existing sound-picture data in the process of manufacturing and playing, meanwhile, the transmission quantity of the sound-picture data can be saved, and the operation cost of the sound-picture data transmission is saved.

The other purpose of the invention is realized by adopting the following technical scheme: there is provided an audio-visual data processing method including: processing the image and sound image, and extracting a first automatic identification mark and/or a second automatic identification mark from the image and sound image; reading and decoding the first automatic identification mark to obtain first coded data and/or reading and decoding the second automatic identification mark to obtain second coded data; analyzing the first coded data according to a preset rule to obtain quasi audio data; and/or analyzing the second coded data according to a preset rule to obtain mark reproduction data and/or mark play parameters.

By means of the technical scheme, the quasi audio data and/or the marking data which are hidden and written in the image sound image formed in the first technical scheme can be extracted, and a data basis is provided for playing the corresponding quasi audio data and/or marking data while the image sound image is displayed.

In the following, the technical scheme of the present invention will be disclosed in more detail in connection with each specific embodiment.

Drawings

FIG. 1 is a schematic diagram of the principle of the encoding device in the embodiment of the present invention 1;

FIG. 2 is a schematic diagram of the encoding apparatus according to the embodiment of the present invention 2;

FIG. 3 is a schematic diagram of the encoding apparatus according to the embodiment of the present invention;

FIG. 4 is a schematic diagram of a decoding apparatus according to an embodiment of the present invention 1;

FIG. 5 is a schematic diagram of a decoding apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a decoding apparatus according to an embodiment of the present invention.

Detailed Description

Before describing in detail various embodiments of the present invention, a specific description of some of the data objects and terms involved in the present invention will be provided for ease of reading hereafter. When researching and developing various technical schemes of the invention, the inventor systematically combs various data objects related to the invention, thereby establishing and defining a plurality of data objects as follows:

1. audio-visual or audio-visual data: the method is an electronic picture obtained by embedding quasi audio data, audio playing parameters and other data into a target picture by adopting a steganography technology including a watermarking technology.

2. Quasi audio data: an object composed of at least audio content data and/or audio playback parameters; wherein the audio content data may be a sequence of symbols having a format of information that may be used by a speech synthesis tool to generate intelligent speech; or an audio content link address, and accessing the link address can obtain the symbol sequence which has a certain format information and can generate intelligent voice by a voice synthesis tool; alternatively, audio content that can be directly played is obtained.

3. Symbol sequences for intelligent speech may be generated by a speech synthesis tool: it is a symbol sequence consisting of human natural language words and specific mark symbols, which can be used to convert into speech or speech signals by artificial intelligence technology.

4. Audio playback parameters: refers to various parameters used by the playback device to control the playback of audio.

5. Marking reproduction data: refers to a data object which can be used for displaying a specific mark together according to a certain rule or mode when the playing device displays an audio image and plays audio.

6. Marking playing parameters: for controlling the playback device to display various parameters of the mark reproduction data.

The specific embodiment of the invention of the 1 st class mainly relates to a technical scheme for generating the image, sound and image. It mainly comprises the following operations:

the quasi audio data is converted to form corresponding first coding data.

Here, so-called quasi audio data, which is not real audio data, but a symbol sequence for generating audio content, which can be used by a speech synthesis tool to generate intelligent speech; or link addresses for obtaining these symbol sequences that can be used by the speech synthesis tool to generate intelligent speech; a link address for directly acquiring the audio content; in summary, quasi audio data is not the audio content itself, but rather is mainly indirect audio content used to obtain or convert to audio content. Therefore, the quasi audio data has small data volume relative to the audio image, and can be conveniently hidden in a target picture in an embedding way.

The aforementioned quasi audio data needs to be encoded according to a predetermined rule or format to form first encoded data, so that when the audio-visual image is played in the future, the first encoded data can be decoded according to the same rule or format, and the quasi audio data is restored.

After the first encoded data is obtained, the first encoded data needs to be further converted to obtain a first automatic identification. The first automatic identification mark herein mainly refers to, for example, a bar code, a two-dimensional code, and a specific code pattern.

Finally, the first automatic identification mark is used as a digital watermark pattern, and is hidden in a target picture in a digital watermark mode, so that an image and sound image carrying quasi audio data is finally obtained. Digital watermarking is a way of steganography, which is to actually conceal data in picture data by using the masking effect of human eyes when watching images. In viewing an image in which data is hidden, the hidden data cannot be observed by human beings due to the masking effect of human eyes.

When the audio-visual image is displayed, the pattern of the audio-visual image is not different from the pattern of the target picture, but the first automatic identification mark containing quasi audio data is hidden; the display device may read and decode these first automatic identifications using specific software or decoding circuitry (chip) to obtain quasi audio data. As described above, these quasi audio data include symbol sequences that can be used to generate intelligent speech by the speech synthesis tool, link addresses for acquiring symbol sequences that can be used to generate intelligent speech by the speech synthesis tool, or link addresses for directly acquiring audio content; after they are obtained, the corresponding audio content can be obtained in the following manner:

for symbol sequences in which intelligent speech can be generated by the speech synthesis tool, the speech synthesis tool can be further utilized to convert the symbol sequences into speech audio, and the speech audio is played as audio content together with the audio-visual image.

For the link address used for obtaining the symbol sequence capable of generating intelligent voice by the voice synthesis tool, the corresponding symbol sequence capable of generating intelligent voice by the voice synthesis tool can be obtained by accessing the link address, and then the symbol sequences are converted into voice audio by the voice synthesis tool, and the voice audio is played as audio content together with the image and sound image.

For the link address for directly acquiring the audio content, the link address is directly accessed, so that the corresponding audio content is directly acquired, and the audio content can be played together with the audio image.

In some cases, it is desirable to include marking data in an audio-visual program, which is primarily intended for the producer of the program to mark what it wishes to emphasize or focus on by the viewer when producing an audio-visual program. These marking data are often used to describe parameters of some marking points, lines, tracks on the target picture.

For this reason, in a class 1 embodiment of the present invention, these tag data may be converted into second encoded data; after the second encoded data is obtained, the second encoded data and the first encoded data may be combined together and converted into a first automatic identification mark for steganography into the target picture together with the first encoded data.

Of course, the aforementioned second encoded data may be independently converted into the second automatic identification, and the second automatic identification may be independently hidden into the target picture, regardless of the presence or absence of the first automatic identification. The purpose of this is to: sometimes, when displaying an audio image, it may not be necessary to play the audio content at the same time, but only the mark pattern. Therefore, it is necessary that the second automatic identification with the marker data is independently generated and steganographically written in the target picture.

In the above-mentioned class 1 embodiment of the present invention, the first encoded data and/or the second encoded data are steganographically written in the target picture in a digital watermark manner, so that the data amount of the audio-visual image and the data amount of the target picture are kept substantially consistent by writing the first encoded data and/or the second encoded data;

on the other hand, since the data in the quasi audio data are mainly symbol sequences which can be used for generating intelligent voice by the voice synthesis tool, link addresses used for acquiring symbol sequences which can be used for generating intelligent voice by the voice synthesis tool or link addresses used for directly acquiring audio content, when the quasi audio data are played, the audio content can be obtained by processing the symbol sequences which are directly or indirectly obtained by adopting an artificial intelligence voice synthesis mode, or the corresponding audio content can be obtained by directly taking the corresponding link addresses as the parties, so that the playing effect which is the same as that of the sound-image data can be obtained on the basis of not increasing the data quantity, namely: the picture is played together with the audio. In addition, the writing of the second coding data can also lead the playing of the video, audio and images to have the effect of marking and track drawing as the playing of the video program; and all the realization is on the occupation of data volume, and the picture, sound and image is more space-saving and the flow of transmission than sound and image data. And calculating with the audio length corresponding to one picture as 30 seconds, and on the premise that the picture content is completely consistent, the data volume required by the audio and video of the picture is at least 150kB bytes less than that of the corresponding audio and video program, and for a playing tool such as a mobile phone, the data volume which can be saved accounts for more than 20% of the audio and video data.

In yet another aspect, due to the use of symbol sequences in which intelligent speech can be generated by speech synthesis tools, it is possible to ensure that the speech information and the text content in the so-called symbol sequences are completely identical, thereby avoiding the problem of poor accuracy in converting text from natural speech using speech recognition.

In some cases, for convenience in application, the first encoded data and the second encoded data may each include one or more data types in different applications, and there may be generally the following four combinations:

only one data type is in each of the first encoded data and the second encoded data; one data type exists in the first coded data, and a plurality of data types exist in the second coded data; the first coded data has a plurality of data types, and the second coded data has one data type; the first encoded data has a plurality of data types and the second encoded data has a plurality of data types.

Therefore, the invention of the type 2 concrete implementation mode further provides a scheme which can convert the first coded data into one or more first automatic identification marks according to the concrete application requirements on the basis of the invention of the type 1 concrete implementation mode; meanwhile, a scheme for converting the second coded data into one or more second automatic identification marks according to specific application requirements is provided.

The purpose of this is that when playing the audio and video, the playing device can select part or all of the first automatic identification marks in the plurality of first automatic identification marks to identify and decode; for example: based on the setting of the rights, the playback device may select the first automatic identification number of 2 rights grants among the 5 first automatic identification numbers to identify and decode. For another example: when the playback device is set by the user to play back only the direct audio content, the playback device recognizes only those first automatic recognition identifications that contain the link address for directly acquiring the audio content, and does not process the other first automatic recognition identifications.

Similarly, the playing device may also select some or all of the plurality of second automatic identification tags to identify and decode. For example: a plurality of mark contents exist at different positions of the same picture, the mark data corresponding to the mark contents can be converted into a plurality of corresponding second automatic identification marks, and the mark contents can be reproduced respectively at different stages of playing the picture; however, at a specific time, the playback device simply selects the second automatic identification mark corresponding to the corresponding period from the plurality of second automatic identification marks to perform the identification and decoding processes.

In summary, the scheme provided by the class 2 embodiment of the invention makes the use of the data in the audio and video more flexible and convenient, and is more suitable for rich and colorful playing applications.

Furthermore, in some cases, based on the selection of the automatic identification mark, there is sometimes no sufficient data space for one automatic identification mark to accommodate all of the first encoded data and/or the second encoded data, and thus it is also necessary to use a plurality of automatic identification marks to accommodate them. However, the precondition for this is: the target picture has a sufficiently large watermark accommodation space.

In the first encoded data, audio playback parameters may be included in addition to the audio content data; the audio playback parameters are mainly used to inform the playback device how to play back the audio content.

For example: when the audio content data is a symbol sequence capable of generating intelligent voice by the voice synthesis tool, the foregoing audio playing parameters may include information of what voice speed, voice modes (male, female, fairy, etc.), whether background sounds exist, etc., and the playing device may call corresponding software and hardware units according to these parameters to implement the specific implementation.

For another example: when the audio content data is an audio content link address, the foregoing audio playing parameters may include control parameters such as a playing rate, an audio type, and the like, the playing device may control the playing speed of the audio content according to such parameters, and the playing device finds and downloads, from among the plurality of audio contents pointed to by the audio content link address, the audio content matching the corresponding audio type parameter.

Also for example: for different marking data, there may be a plurality of different reproduction accuracy, reproduction speed and colors at the time of reproducing them.

For this reason, the 3 rd class of embodiments of the present invention provides the following schemes based on the 1 st and 2 nd class of embodiments of the present invention:

the audio content data and/or the audio playing parameters are used as the content of quasi audio data and are integrally encoded into first encoded data according to a preset protocol;

on the other hand, for the marking data, sampling under various different frequencies can be performed so as to obtain corresponding marking reproduction data and/or marking playing parameters; the mark reproduction data and/or mark play parameters are then integrally encoded into second encoded data according to a predetermined protocol.

By analogy, parameters or data related to playing, transmitting, decoding, decrypting and the like of the audio content and the marking data can be respectively and integrally contained in the first coding data and the second coding data besides respectively containing the audio content or the marking data, so that the application requirements of playing various images and audio can be met based on the type 3 specific embodiment of the invention.

The invention in the 4 th class is based on the 1 st, 2 nd and 3 rd class, and provides a technical scheme for encrypting key quasi audio data and/or marking data. The meaning of encryption is self-evident and not necessary. In the specific embodiment of the invention of the class 4, the encryption can be respectively carried out according to the corresponding data content and type, or can be respectively carried out in different modes. Thus, the corresponding encrypted data are isolated from each other, and the security is higher.

The 5 th specific embodiment of the present invention is a specific technical scheme that the 1 st specific embodiment of the present invention is implemented by hardware. Referring to fig. 1, in a class 5 embodiment of the present invention, an encoding apparatus 2 (specifically, may be an integrated circuit, a separate chip, firmware, etc.) includes: an encoding unit 201, an automatic identification mark generating unit 202, and a steganography unit 203; wherein: the encoding unit 201 encodes the input quasi audio data 101 into first encoded data; the first encoded data is sent to the automatic identification mark generating unit 202, the automatic identification mark generating unit 202 transforms the first encoded data into an automatic identification symbol pattern 302, the steganography unit 203 steganographically writes the automatic identification symbol pattern 302 into the target picture 301, and finally, the audio image 401 is obtained.

Referring to fig. 2, in the class 6 embodiment of the present invention, the encoding unit 201 of the encoding apparatus 2 may encode the tag data 102 in addition to the input quasi audio data 101 to generate first encoded data. Similarly, the first encoded data is sent to the automatic identification mark generating unit 202, the automatic identification mark generating unit 202 transforms the first encoded data into an automatic identification symbol pattern 302, the steganography unit 203 steganographically writes the automatic identification symbol pattern 302 into the target picture 301, and finally, the audio image 401 is obtained.

Similar to the embodiments of the present invention of the 5 th and 6 th types, in the embodiment of the 7 th type of the present invention, the encoding unit 201 of the encoding apparatus 2 may convert the quasi audio data 101 and the mark data 102 into the first encoded data; the tag data 102 may also be separately converted into second encoded data; when the first encoded data and/or the second encoded data outputted from the encoding unit 201 are obtained, the automatic identification mark generating unit 202 correspondingly converts the first encoded data and/or the second encoded data into an automatic identification mark pattern 302; referring to fig. 1 and 2, when the automatic identification mark generating unit 202 obtains the first encoded data, the automatic identification mark generating unit 202 generates an automatic identification symbol pattern 302 corresponding to the first encoded data; when the automatic identification mark generating unit 202 obtains the second encoded data, the automatic identification mark generating unit 202 generates an automatic identification symbol pattern 302 corresponding to the second encoded data; that is, any encoded data is converted into a corresponding automatic identification symbol pattern 302 by the automatic identification mark generating unit 202.

The class 8 embodiment of the invention is applicable to the following situations: the encoding unit 201 individually encodes the tag data 102 into second encoded data, and the automatic identification mark generating unit 202 individually converts the second encoded data into a second automatic identification mark pattern 302; then, the steganography unit 203 independently steganographically writes the auto-id pattern 302 corresponding to the second encoded data into the target picture, regardless of whether there is the auto-id pattern 302 corresponding to the first encoded data.

Referring to fig. 3, in the class 9 embodiment of the present invention, the encoding unit 201 of the encoding apparatus 2 is further provided with an encryption unit 201a in front of the encoding unit 201. The encryption unit 201a functions to encrypt the quasi audio data 101 and/or the tag data 102 according to a user's setting before the encoding unit 201 converts them into the first encoded data and/or the second encoded data, so as to increase the security of the quasi audio data 101 and/or the tag data 102 if necessary. In practice, the encryption unit 201a may also be provided between the encoding unit 201 and the automatic identification mark generation unit 202, such arrangement being such that the first encoded data and/or the second encoded data are actually encrypted in their entirety; whereas the encryption unit 201a is provided before the encoding unit 201, the audio data 101 and/or the tag data 102 can be subjected to finer encryption processing. For example: the corresponding encryption scheme may be selected according to the specific type of quasi audio data 102; for another example: the encryption processing may be performed based on different data attributes within the tag data 102, respectively.

The specific embodiments of the invention mainly provide a novel audio-visual data-image-audio-visual generation technical scheme. In addition, the invention also provides various technical schemes for decoding the audio-visual images through the following various specific embodiments.

The 10 th specific embodiment of the present invention is a basic technical scheme for decoding the aforementioned audio-visual image. Wherein, for the picture-sound image to be decoded, firstly, using digital watermarking technology to extract the first automatic identification mark and/or the second automatic identification mark of the hidden writing from the picture-sound image.

It is well known that: when steganographic data is written into an image by digital watermarking technology or otherwise recovered from an image, the data needs to be subjected to processing such as fourier transform, wavelet transform and inverse transform thereof, and quantization processing is performed on the transformed result, so that at last, in the stage of recovering the data, the extracted data usually has some changes due to the foregoing reasons. Thus, in the various embodiments of the present invention described above, the data that is hidden in the target image needs to be encoded and transformed into an automatically identifiable identification pattern.

The purpose of this is to: since the automatic identification mark pattern itself has a certain pattern structure rule, coding rule, fault tolerance and verification rule or algorithm, even if some change occurs in the above-mentioned various transformation, inverse transformation and quantization process data (mainly represented by pattern distortion of the automatic identification mark by steganography), it is possible to ensure that the data steganographically in the audio image is completely and accurately restored.

Therefore, in the specific embodiment of the present invention of the 10 th class, based on the first automatic identification mark and/or the second automatic identification mark which have been extracted as described above, the first coded data can be obtained from the first automatic identification mark and the second coded data can be obtained from the second automatic identification mark by reading and decoding them using the reading and decoding rule corresponding to the first automatic identification mark and/or the second automatic identification mark. Based on different application needs, only the first automatic identification mark may exist in some images, while only the second automatic identification mark may be hidden in other images, or the first automatic identification mark and the second automatic identification mark may be hidden in the same image. In any event, the first and/or second self-identifying indicia are read and decoded in the manner described above. In a class 10 embodiment of the present invention, the first encoded data is carried by a first self-identifying identifier and the second self-identifying identifier carries the second encoded data.

Reference is made to the various embodiments hereinbefore described to the accompanying drawings: and reading and decoding the first automatic identification mark and/or the second automatic identification mark to obtain the first coded data and/or the second coded data. Therefore, the first encoded data obtained after the reading and decoding is also required to be parsed to obtain the quasi audio data encoded therein; also, the first encoded data obtained after reading and decoding needs to be parsed to obtain the mark reproduction data and/or the mark play parameters.

The 10 th specific embodiment of the present invention is mainly a technical scheme for extracting information such as quasi audio data and/or mark reproduction data, which are hidden in the audio-visual image through the above specific embodiments. After the aforementioned extracted quasi audio data and/or mark reproduction data and mark play parameters, etc. are obtained, they may also need to be played on a playback device.

In the 11 th class of the embodiment of the present invention, the quasi audio data obtained above needs to be processed first to obtain audio content data and/or audio playing parameters; the audio content data therein may take a variety of forms, such as: it may be a sequence of symbols that may be used by a speech synthesis tool to generate intelligent speech; it may also be a content linking address that obtains a sequence of intelligent phonetic symbols that can be generated by the speech synthesis tool; it may also be a link address for obtaining audio content. For various different forms of audio content data, the following corresponding playing operations can be performed respectively:

when the audio content data is a symbol sequence capable of generating intelligent voice by the voice synthesis tool, playing the audio content data and the audio playing parameters, displaying the image, sound and image, converting the symbol sequence capable of generating intelligent voice by the voice synthesis tool into audio content and playing the audio content.

When the aforementioned audio content data is a content link address where the intelligent voice symbol sequence is generated by the voice synthesis tool, the playback apparatus first needs to obtain the symbol sequence where the intelligent voice is generated by the voice synthesis tool and the aforementioned audio playback parameters based on this content link address, and thereafter, the audio-visual image is displayed and the symbol sequence where the intelligent voice is generated by the voice synthesis tool is converted into audio content and played back, as in the case described above.

When the audio content data is an audio content link address, it is necessary to obtain corresponding audio content based on the audio link address, display an audio image, and play the obtained audio content according to audio play parameters. In contrast to the previous two cases, in this case the link address points to not a smart phonetic symbol sequence that can be generated by the speech synthesis tool, but pre-recorded or generated audio content. Thus, after it is obtained, it can be played directly.

It should be noted that: in any of the above cases, the audio playback parameters may be one or more of the following: parameters for controlling the time relation between audio broadcasting and audio and image displaying, parameters for controlling the speed, intermittent and voice type of audio broadcasting, and parameters for special effect control in specific application scenes, etc. Of course, the foregoing audio playing parameters may not be available, and in this case, the playing device may play the corresponding audio content according to the parameters preset by the user or default parameters. Sometimes, in some specific cases, only audio playing parameters may be present in the audio data content, which typically involves the case of just reproducing the mark-up content; the audio playing parameters are used to control the playing of system audio for reproducing the marked content.

In some cases, the second encoded data in the foregoing embodiments is carried into the first automatic identification tag. Correspondingly, in the specific embodiment of the invention in the 12 th class, the second encoded data can be obtained when the first automatic identification mark is read and decoded.

As described above, the primary code in the second encoded data carries the mark reproduction data and/or mark play parameters. The mark reproduction data is actually data for reproducing marks or tracks of points, lines on the audio-visual image in accordance with the mark play parameters during the display of the audio-visual image of the present invention. For example: when a teacher interprets the content of an audio image using a display device with a touch screen, a series of marking patterns, which actually consist of a series of coordinate data associated with the audio image, may be drawn on the touch screen in a dot-by-dot manner. The device with the touch screen can record the time sequence and interval information of the input of each coordinate point on the mark patterns, so that the coordinate points are reproduced according to the recorded time sequence and interval information when the coordinate points of the marks are reproduced, and the marks are reproduced on the playing device just like the teacher circle point at the beginning.

The time sequence and interval information is actually a type of marking playing parameters; in addition to these, the mark play parameters may include control parameters of the color of the mark, the size of the coordinate point (the number of pixel points), the thickness of the mark track, the flicker, and the like. In short, in the second encoded data, the marker reproduction data may be stored alone or the marker playback parameters may be stored alone or both of them may be stored together.

However, it is easier to understand that the mark reproduction data is stored separately in the second encoded data, but it is not easy to understand why the mark play parameter can be stored separately. Indeed, in some referencing software, the mark reproduction data may exist in the form of templates, and when the playback device obtains the associated template number, the mark reproduction data can be automatically generated without storing them in the second encoded data. Since the mark reproduction data can be templated, it is natural that the mark play parameters can be individually stored in the second encoded data to individually reproduce the templated mark reproduction data.

After the second encoded data is obtained, the specific embodiment of the 12 th class of the present invention further analyzes the second encoded data according to a predetermined rule to obtain the mark reproduction data and/or the mark play parameter stored therein; then, based on these mark reproduction data and/or mark play parameters, mark contents corresponding to these mark reproduction data are displayed on the play device.

In some specific application scenarios, a plurality of first automatic identification marks and a plurality of second automatic identification marks may be hidden in one target picture based on specific purposes. For example: the user may play or present different audio, marks and text content for different recipients in a group-transmitted audio image. Thus, the first and second plurality of automatic identification marks may be used to tell the recipient to play the content encoded in the first and or second automatic identification marks corresponding to the recipient.

For this reason, in the specific embodiment of the 13 th class of the present invention, the following operations are required for the foregoing cases:

when the first automatic identification mark and/or the second automatic identification mark embedded in the image sound image are detected to be a plurality of, the first automatic identification mark and/or the second automatic identification mark are respectively read and decoded.

Corresponding to the technical scheme of the invention related to encrypting the audio data and/or the marking data in class 4, the invention in the specific implementation mode of class 14 further comprises: and when the quasi audio data and/or the marking data are obtained, decrypting the quasi audio data and/or the marking data.

The 15 th specific embodiment of the present invention is a specific technical scheme implemented by hardware in the 10 th specific embodiment of the present invention. Referring to fig. 4, in a class 15 embodiment of the present invention, the decoding device 5 (specifically, may be an integrated circuit, a separate chip, firmware, etc.) includes: a detection extraction unit 501, a reading decoding unit 502 and an encoding analysis unit 503; wherein: the detection extraction unit 501 detects the sound image 401, and extracts the automatic recognition identification pattern 601 therefrom.

The automatic identification pattern 601 is sent to the reading and decoding unit 502 for reading and decoding to obtain first encoded data (not shown); next, the first encoded data is sent to the encoding parsing unit 503 for parsing, and finally quasi audio data 101 is obtained. The quasi audio data 101 will be used and/or stored in a subsequent application.

Referring to fig. 2, in the class 16 embodiment of the present invention, when detecting the audio image 401, the detection extracting unit 501 of the decoding apparatus 5 may extract, in addition to the first automatic identification mark, a second automatic identification mark (as long as the second automatic identification mark exists in the audio image); in addition, in the specific embodiment of the 16 th class of the present invention, the first automatic identification mark and the second automatic identification mark may be extracted by the detection extraction unit 501, respectively, as long as they exist in the audio image 401.

As with the first automatic identification, the second automatic identification is also sent to the reading and decoding unit 502 for reading and decoding, so as to obtain second encoded data (not shown). The second encoded data is further sent to the encoding analysis section 503, and the second encoded data is analyzed to obtain the tag data 102.

Therefore, the 16 th specific embodiment of the invention is actually a perfection of the technical scheme disclosed by the 15 th specific embodiment of the invention, and can extract the first automatic identification mark and read and decode the first automatic identification mark, and can also extract the second automatic identification mark and read and decode the second automatic identification mark; then, based on the analysis operation on the encoded data obtained in the previous operation, respectively, quasi audio data 101 and mark data 102 are obtained.

Of course, in some cases, the user will put the tag data and the quasi audio data in the first automatic identification mark for carrying, so in the 16 th specific embodiment of the present invention, the reading and decoding unit 502 also includes a case of simultaneously obtaining the first encoded data and the second encoded data after decoding the first automatic identification mark, in which case the encoding analysis unit 503 will analyze the first encoded data and the second encoded data to obtain the corresponding quasi audio data 101 and the tag data 102.

Similar to the embodiments of the present invention in the 15 th and 16 th classes, and corresponding to the embodiment of the present invention in the 9 th class, in the embodiment of the present invention in the 17 th class, a decryption unit 503a is added between the reading and decoding unit 502 and the encoding parsing unit 503, which corresponds to the setting of the encryption unit in the encoding device in the embodiment of the 9 th class. Obviously, the decryption unit 503a decrypts the first encoded data and/or the second encoded data encrypted by the encryption unit.

Claims

1. A method of audio-visual data generation, comprising:

converting the quasi audio data and/or the marking data to form corresponding first coding data and/or second coding data;

converting the first coded data and/or the second coded data into corresponding first automatic identification marks and/or second automatic identification marks;

the quasi audio data corresponds to the first coded data, the marking data corresponds to the second coded data, the first coded data is converted into a corresponding first automatic identification mark, and the second coded data is converted into a corresponding second automatic identification mark;

embedding the first automatic identification mark and/or the second automatic identification mark into the target picture in a digital watermark mode to form an image, sound and image.

2. The method according to claim 1, wherein said converting said first and/or second encoded data into corresponding first and/or second auto-id tags comprises:

converting the first coded data into more than one first automatic identification mark;

and/or

The second coded data is converted into more than one second automatic identification mark.

3. The method according to claim 1, wherein the forming the corresponding first encoded data and/or second encoded data comprises:

integrally encoding the audio content data and/or the audio playing parameters into first encoded data according to a predetermined protocol;

and/or

Sampling the collected marking data according to a preset frequency to obtain marking reproduction data;

and integrally encoding the mark reproduction data and/or the mark play parameter into second encoded data according to a preset protocol.

4. A method according to claim 1 or 2 or 3, characterized in that:

the quasi audio data and/or the marking data are also encrypted and/or encrypted in a plurality of ways based on the type and/or kind of the data before being converted.

5. A method of audio-visual data processing, comprising:

processing the image and sound image, and extracting a first automatic identification mark and/or a second automatic identification mark from the image and sound image;

reading and decoding the first automatic identification mark to obtain first coded data and/or reading and decoding the second automatic identification mark to obtain second coded data;

analyzing the first coded data according to a preset rule to obtain quasi audio data; and/or analyzing the second coded data according to a preset rule to obtain the marking data.

6. The method as recited in claim 5, further comprising:

processing the quasi audio data to obtain audio content data and/or audio playing parameters;

when the audio content data is a symbol sequence capable of generating intelligent voice by a voice synthesis tool, displaying the image and sound image according to the audio content data and/or the audio playing parameters, converting the symbol sequence capable of generating intelligent voice by the voice synthesis tool into audio content and playing the audio content;

or alternatively

When the audio content data is a content link address, obtaining a symbol sequence capable of generating intelligent voice by a voice synthesis tool and/or the audio playing parameter according to the content link address, displaying the picture, sound and image, converting the symbol sequence capable of generating intelligent voice by the voice synthesis tool into audio content and playing the audio content;

or alternatively

And when the audio content data is an audio content link address, displaying the image and sound image according to the audio playing parameter, and obtaining and playing the corresponding audio content based on the audio content link address.

7. The method according to claim 5 or 6, further comprising:

reading and decoding the first automatic identification mark to obtain second coded data;

analyzing the second coded data according to a preset rule to obtain marking data;

and displaying the mark content corresponding to the mark reproduction data based on the mark reproduction data and/or the mark play parameter in the mark data.

8. The method as recited in claim 5, further comprising:

and processing the image-sound image, and when detecting that the first automatic identification mark and/or the second automatic identification mark embedded in the image-sound image are multiple, respectively reading and decoding the multiple first automatic identification marks and/or the multiple second automatic identification marks.

9. The method according to claim 5 or 6, characterized in that:

and when the quasi audio data and/or the marking data are obtained, decryption processing is further carried out on the quasi audio data and/or the marking data.