Creating .epub files in Python - Part 3: The theory behind .epub files

bloodviolet (46)in #utopian-io • 8 years ago (edited)

What Will I Learn?

In this tutorial, we will talk about the theory behind the .epub file

How a .epub file is structured
The limitations of this file type
Required files inside of the .epub file

Requirements

This time nothing is required to follow this tutorial.

Difficulty

The theory behind this can be a little hard to understand:

Intermediate

Tutorial Contents

In the last two parts we talked about how to download files and format them, so we can use them in our .epub file.

This time we will talk about the .epub file in more detail. So what exactly is a .epub file and how do we create one?

A .epub file is nothing more than a .zip file with some mandatory files, so in theory, it is pretty easy to create in Python but don't be fooled, creating the .epub file will be by far the hardest part of this tutorial.
The reason for this is because although it is a well-documented standard, there are a lot of difficulties to solve if you do not have a clue or this tutorial as a guide.
Part of it is because there are a lot of different E-Readers out there and especially Amazons Kindle is sometimes really hard to debug and to format. In this tutorial, we won't talk about adding CSS files, but adding one that works on all e-readers out there is a near impossible task if you don't have advanced knowledge in CSS and ebook formatting.

General Info

Even though we use the .xhtml standard (1.1) we can't use forms, server-side image mapping, intrinsic events or scripting, so keep that in mind.

As the encoding I will use UTF-8, UTF-16 is also allowed but may not work on some e-readers. If not necessary I would also not advice using UTF-16, especially if you want to use special fonts because it will increase the size of the file by a significant amount (1000% and more).

If you want to use a .css file in the future, be simple about it and don't try anything fancy. Summarizing the limitations is close to impossible so check out the ops specifications if you're interested.

If you want to use special fonts, go with OpenType fonts (.otf). They should work on most of the e-readers out there.

Now let's talk about the required files. In every .epub file you will find the files:

mimetype
container.xml
content.opf
toc.ncx

Mimetype

The mime-type file is used to identify what kind of file the reader deals with and in case of a .epub file the content of it looks like this:

application/epub+zip

Container.xml

container.xml is a container file, who would have guessed. The content of it looks like this:

<container version="1.0"
  xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
     <rootfile full-path="OEBPS/Content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>

The only thing you should change is the path of Content.opf if you want to change it.

Content.opf

The content.opf file is a little bit more complicated. It contains metadata like title, author, language, subject, description of the book, publisher, rights and publishing data (and more). It can also contain custom metadata. This can be useful if you specifically want to target Calibre so you can add stuff like series of the book and so on. Also optional is a unique identifier for the ebook located in the metadata part of content.opf.
The content.opf file also contains the manifest that holds the link to the XHTML files as well as their id. This is important so the E-Reader knows the structure of the ebook, we also need to reference it later on in the toc.nxc file. Basically, it holds every file that is part of the book.

The manifest does not only hold the location of the XHTML files but also the link for the cover (if you add one), the CSS file, in case you want it to apply to the whole book, a logo, the specs of the file (won't go into detail on that since it's pretty advanced) and the location of the table of content file (toc.nxc). The order is not relevant.

After the manifest comes the spine, which is the reading order definition. It holds all files of the manifest with no duplicates. All files must be part of the .epub file so you can't reference anything outside of it. The order is significant.

After the spine, you could add the guide, but this is optional so let's just skip that.

Here is a example content.opf file of the finished program:

<package version="3.1"
xmlns="http://www.idpf.org/2007/opf">
        <metadata>
            <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Emperor's Dominaion: 2-4</dc:title>
    <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ns0="http://www.idpf.org/2007/opf" ns0:role="aut" ns0:file-as="Unbekannt">Yan Bi Xiao Sheng, Bao</dc:creator>
    <meta xmlns:dc="http://purl.org/dc/elements/1.1/" name="calibre:series" content="Emperor's Dominaion"/>
        </metadata>
        <manifest>
            <item id="file_1" href="emperor-chapter-2.xhtml" media-type="application/xhtml+xml"/><item id="file_2" href="emperor-chapter-3.xhtml" media-type="application/xhtml+xml"/><item id="file_3" href="emperor-chapter-4.xhtml" media-type="application/xhtml+xml"/><item href="toc.xhtml" id="toc" properties="nav" media-type="application/xhtml+xml"/>
            <item href="cover.jpg" id="cover" media-type="image/jpeg" properties="cover-image"/>
        </manifest>
        <spine>
            <itemref idref="toc" linear="no"/>
            <itemref idref="file_1" /><itemref idref="file_2" /><itemref idref="file_3" />
        </spine>
    </package>

Table of Content

The last file that is required is the tox.ncx file that contains the table of content. This has the purpose of making it easy to navigate using an E-Reader.
Although I referred to the file as toc.ncx, in the finished program we will have a toc.XHTML file. This is because when I initially wrote the program I had trouble debugging the file and used the familiar structure of HTML files to make it a little easier. Some e-readers will probably have some problems with it, but 99,9% will accept the file.
The file will create navPoints that refer to a specific chapter. It also can refer to a list of chapters, so you can nest navPoints. This can be useful if you have a complicated structure or a lot of content. In this tutorial, we won't do that.

Here is an example toc.xhtml file of the finished program:

<?xml version='1.0' encoding='utf-8'?>
    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
    <head>
        <title>Emperor's Dominaion</title>
    </head>
    <body>
        <section class="frontmatter TableOfContents">
            <header>
                <h1>Contents</h1>
            </header>
            <nav id="toc" role="doc-toc" epub:type="toc">
                <ol>
                <li class="toc-Chapter-rw" id="num_0">
        <a href="emperor-chapter-2.xhtml"><title>Chapter 2 : Old Devil (2)</title></a>
        </li><li class="toc-Chapter-rw" id="num_1">
        <a href="emperor-chapter-3.xhtml"><title>Chapter 3 : Cleansing Incense Ancient Sect (1)</title></a>
        </li><li class="toc-Chapter-rw" id="num_2">
        <a href="emperor-chapter-4.xhtml"><title>Chapter 4 : Cleansing Incense Ancient Sect (2)</title></a>
        </li>
        </ol></nav></section></body></html>

With all the files in place the folder structure will, in the end, look like this:

ebook
    META-INF
        container.xml
    OEBPS
        Content.opf
        cover.jpg
        file1.xhtml
        file2.xhtml
        ...
        toc.xhtml
    mimetype

This is the recommended folder structure, you could use a different one as long as you adjust the paths.

I hope you now have a good understanding of how a .epub file is structured. Even if you did not understand everything (or anything at all), you'll still be able to complete the tutorial series.

In the next part, we will take a look at using the today acquired knowledge and translate it into Python code for our program to use.

Curriculum

Here are the links for the last two part of this tutorial series:

Posted on Utopian.io - Rewarding Open Source Contributors

#python #programming #coding #tutorial

8 years ago in #utopian-io by bloodviolet (46)

Sort:

steemitstats (47) 8 years ago

@bloodviolet, Upvote is the only thing I can support you.

$0.00

1 vote

[-]

spaminator (67) 8 years ago

When you repeat the same comment on multiple post you sound like a bot! If it walks like a bot, squawks like a bot, it may be flagged for being a bot!

Your Reputation Could be a Tasty Snack with the Wrong Comment!

$0.82

2 votes

[-]

utopian-io (71) 8 years ago

Hey @bloodviolet I am @utopian-io. I have just upvoted you!

Achievements

You have less than 500 followers. Just gave you a gift to help you succeed!
Seems like you contribute quite often. AMAZING!

Suggestions

Contribute more often to get higher and higher rewards. I wish to see you often!
Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

Vote for my Witness With SteemConnect
Proxy vote to Utopian Witness with SteemConnect
Or vote/proxy on Steemit Witnesses

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

$0.00

1 vote

[-]

amosbastian (72) 8 years ago

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

$0.00

1 vote