Creating .epub files in Python - Part 4: Creating the .epub file

in #utopian-io7 years ago (edited)

What Will I Learn?

  • Create a .epub file in Python
  • Create files in Python
  • Use variables inside of strings

Requirements

Write here a bullet list of the requirements for the user in order to follow this tutorial.

  • Python version 3.6.x
  • The libraries: requests and beautifulsoup4
  • Calibre (optional)

Difficulty

This part is not as hard as I imagined it to be:

  • Basic/Intermediate

Tutorial Contents

Before this tutorial starts, I just wanted to mention that I will not explain the detail on the meaning and purpose of strings contained inside the required files of .epub files as I already did that last part. If you have any questions about why it stands there or if you can change it, check out the last part The theory behind .epub files
This will be the last part of this tutorial series. This time we will only about one function, that is responsible to put all the data together we gathered and modified in the previous parts.
If you've read the last part, you'll know that a .epub file is nothing more then a zip file, so to create one we have to import another build-in library.

import zip file

Now we will define the function inside of functions.py. As parameters we need to add everything that we might need, which is in this example:

  • A list of the HTML files we want to include (html_files)
  • The title of the novel (novelname)
  • The author (author)
  • The starting chapter passed as a string (chapter_s)
  • The ending chapter passed as a string (chapter_e)

For what and why we will use these parameters, will be answered later on.

def generate(html_files, novelname, author, chapter_s, chapter_e):

The first line of the function will create a Zip file with the variable name epub. In Python, we do this with the zipfile.ZipFile() function. Inside the brackets, we will define the name of the Zip file as well as the permissions we need.

epub = zipfile.ZipFile(novelname + "_" + chapter_s + "-" + chapter_e + ".epub", "w")

Great! Now we have our empty epub file, the next step is to create the required files. Let's start with the easiest file to create, mimetype.

epub.writestr("mimetype", "application/epub+zip")

We have a number of ways to create files in Python and in this example we will use the writestr() function.
The file itself will be called mimetype, with no file extension and the file will contain the string "application/epub+zip".

The next file we create is the container.xml file. This file is the same in every ebook we create, like the mimetype file, so we'll create it in a similar fashion.

epub.writestr("META-INF/container.xml", '''<container version="1.0"
xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
    <rootfile full-path="OEBPS/Content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>''')

As you can see we created the file inside a folder called "META-INF" and referenced Content.opf, which will be located in a folder called "OEBPS". As for the meaning behind the rest of the string, I won't go into detail on that as it is out of the scope of this tutorial.

Now we come to the more complex files. In the last part, I explained what the content and purpose of Content.xml is, so check that out if you have any questions. What's important is that the files content fill differs from ebook to ebook.
Basically, inside the metadata, manifest and spine tag we have to insert a string that we have to create depending on the content of the ebook.

index_tpl = '''<package version="3.1"
xmlns="http://www.idpf.org/2007/opf">
  <metadata>
    %(metadata)s
      </metadata>
        <manifest>
          %(manifest)s
        </manifest>
        <spine>
          <itemref idref="toc" linear="no"/>
          %(spine)s
        </spine>
</package>'''

This is what the string will look like. Everything inside %()s will be replaced by strings we will create depending on the content of the ebook. The only thing inside the spine that won't change is the of the table of contents, so we can include them right away.

So let's fill in the blanks. For the manifest and spine we'll have to use a for loop to create the necessary strings, let's do that later and just create two variables with empty strings.

manifest = ""
spine = ""

For the metadata we won't need a loop so let's create it right away:

metadata = '''<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">%(novelname)s</dc:title>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ns0="http://www.idpf.org/2007/opf" ns0:role="aut" ns0:file-as="NaN">%(author)s</dc:creator>
    <meta xmlns:dc="http://purl.org/dc/elements/1.1/" name="calibre:series" content="%(series)s"/>''' \
% {
      "novelname": novelname + ": " + chapter_s + "-" + chapter_e, "author": author, "series": novelname}

If you look through the code, you'll see that we have to add some parts for it to work. First, we have to add %(novelname)s. This will be the name displayed inside of e-readers. It does not have to match the name of the .epub file itself. In the lower part of the code block, I've inserted the name. You could change it to something different if you want. Then we add the author, which is the same as the author parameter passed onto the function. The series is a custom piece of metadata that will be displayed in Calibre, a program that will manage ebooks.

Before we create the for loop, we'll create a string that contains the manifest data for the table of content, that will be added at the end of the manifest string.

toc_manifest = '<item href="toc.xhtml" id="toc" properties="nav" media-type="application/xhtml+xml"/>'

The for loop will look like this:

 for i, html in enumerate(html_files):
    basename = os.path.basename(html)
    manifest += '<item id="file_%s" href="%s" media-type="application/xhtml+xml"/>' % (
        i + 1, basename)
    spine += '<itemref idref="file_%s" />' % (i + 1)
    epub.write(html, "OEBPS/" + basename)

So what does it do? enumerate(list) is basically the same as range(len(list)), we're practically going through the list that contains the list with the html files, while i will be the number of the for loop iterations and html will be the current html file.
os.path.basename(html) will give us the basename of the file. In case the folder structure would be '/foo/bar/', it would return 'bar'.
In the next part, we will add a string to manifest every time we go through the for loop. The first %s will be the file number and the second is the link to the chapter. Since the chapter files are in the same folder as the content file, we can just use the basename.
In the spine, we will reference each chapter with the id.
In the last line of the loop, we will write the chapter into the .epub file. Up to now, they are located outside of it.

We now have strings that contain all the relevant info of content.opf, we can write it.

    epub.writestr("OEBPS/Content.opf", index_tpl % {
              "metadata": metadata,
              "manifest": manifest + toc_manifest,
              "spine": spine, })

As you can see we use the gathered information from the for loop to fill in the blanks of the index_tpl string.

The last step is to create the Table of Contents. We will do like the last time. First, we create a string with variables that we then fill. As mentioned before, if you want to know what exactly the string means check out the last part.
We will divide the toc file into three parts, start, mid and end.

    toc_start = '''<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
    <head>
        <title>%(novelname)s</title>
    </head>
        <body>
            <section class="frontmatter TableOfContents">
        <header>
            <h1>Contents</h1>
        </header>
            <nav id="toc" role="doc-toc" epub:type="toc">
                <ol>
                    %(toc_mid)s
                    %(toc_end)s'''
toc_mid = ""
toc_end = '''</ol></nav></section></body></html>'''

Before we can start filling the blanks we have to define a function that will retrieve the chapter name, that we dumped inside the title tag of each chapter. This is a very simple function that should be located above the generate() function.

def find_between(file):
    f = open(file, "r", encoding = "utf8")
    soup = BeautifulSoup(f, 'html.parser')
    return soup.title

Now add this for loop at the end of the generate() function:

    for i, y in enumerate(html_files):
        chapter = find_between(html_files[i])
        chapter = str(chapter)
        toc_mid += '''<li class="toc-Chapter-rw" id="num_%s">
                   <a href="%s">%s</a>
                   </li>''' % (i, html_files[i], chapter)

The loop follows the same general logic as the last one. We loop as often as there are chapters and append a string to a different one till we have everything we need.

Now we only have to write the toc.xhtml file and close the epub file:

epub.writestr("OEBPS/toc.xhtml", toc_start % {"novelname": novelname, "toc_mid": toc_mid, "toc_end": toc_end})
epub.close()

And delete all the files we don't need:

for x in html_files:
    os.remove(x)

And we're done with the functions.py file. The only thing left is to call the generate function() inside our main.py file. Save functions.py and open main.py.
In the second part of the tutorial, we created two loops. Just above the second for loop you want to create an empty list, where the file names of the chapters will be located in to fill the html_files parameter of the generate() function.

file_list = []

To create the file list, just add this as the second last line of the for loop:

file_list.append(info["ChapterName"] + str(name_counter) + ".xhtml")

As the last line of the main.py file, outside of the for loop, we'll call the generate() function:

functions.generate(file_list, info["NovelName"], info["author"], starting_chapter, ending_chapter)

And we're done! We successfully created a program that will download chapters from a web novel available on the internet, formats the data so it is pretty to look at and creates a file that allows us to read them on an E-Reader. I hope this was somewhat useful.

Here are the two files we created during this tutorial on Pastebin:
main.py
functions.py

Thanks for Reading! If you have any questions, feel free to let me know and I'll try my best to answer them.

Curriculum



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Hey @bloodviolet I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Suggestions

  • Contribute more often to get higher and higher rewards. I wish to see you often!
  • Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

  • Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]