Creating .epub files in Python - Part 1: Getting the data
What Will I Learn?
In this tutorial, I will explain how you can create .epub files from data you scraped from the internet with Python. I will explain how the linked program works while keeping it as general as possible so you can apply my solution to your problems.
The key concepts you can learn following this tutorial series are:
- How to get HTML documents from the internet
- How to get the information you want from the HTML document
- Create an ebook (.epub) without the help of external programs or libraries
- Learn the basic structure of .epub files
- Learn how to debug epub files
Requirements
For this tutorial you will need:
- Python version 3.6.x
- The libraries: requests and beautifulsoup4
Difficulty
I will try to keep it as simple as possible. If you have more advanced knowledge of Python you can just skip the parts where that you already know. My goal is that even if you have close to no knowledge of Python, you still can follow this tutorial. So the difficulty is:
- Basic
Tutorial Contents
When I started writing the program I was a little lost since there was no in-depth tutorial on creating epub files in Python, so I decided to make my own.
This tutorial will be a multi-part series, so I'll be able to make it as detailed as possible in case somebody with limited coding experience wants to create a similar program.
In the first part of this tutorial I we will take a look in how to get data we want to put into the .epub file. This includes downloading the HTML file of the website and saving it in a HTML file.
For this tutorial, we will work with the website wuxiaworld.com, since they have a good link structure that will make it easy to write a useful program. To keep it as simple as possible we will only focus on one novel available on the website.
At the end of this tutorial we have a program that generates a .epub file of a specific range of chapters from the novel "Emperor's Domination".
Before we start, make sure you have Python installed and the necessary libraries that are listed in the requirements part. Since there are tons of explanations on how to install libraries in Python, I won't go into detail on how to do it.
The first thing you want to do is create a new text file (I named it "main.py")
Before we write our first lines of code, we want to check how the link structure looks on the site. If you open http://www.wuxiaworld.com/emperor-index/ you'll see the index of the novel. Clicking on some random chapter we see that the link is nothing more than a base link + the chapter number. This will make it as easy as it get's for automatically downloading a specific range of chapters.
So to start things off we want to create a directory with information we could need. So in this example, I decided on 4 entries in the directory.
- The base link
- The base name for the XHTML files we're going to generate
- The name of the Novel for the .epub file
- The name of the author and the translator
In theory, you could just hard code this information since we're only working with one novel, but if you want to add support for new novels, it makes more sense to work with variables.
So to start things of we're going to create a directory with the name "info" that contain the before mentioned entries:
info = {"link" : "http://www.wuxiaworld.com/emperor-index/emperor-chapter-",
"ChapterName" : "emperor-chapter-",
"NovelName" : "Emperor's Dominaion",
"author" : "Yan Bi Xiao Sheng, Bao"}
If you want you could add more entries like a link to a cover but for this tutorial, I'll limit myself to those four.
Since we now have the information to work with we're going to need some sort of input to know which chapters the user wants. In Python, we do this with the input() function.
Since we want to store the input in a variable we create one with the name "starting_chapter" and "ending_chapter" that has the value of the input of the user. Inside the brackets of input(), we can write a string so the user knows what to enter.
starting_chapter = input("What chapter do you want to start at?: ")
ending_chapter = input("Till what chapter do you want to read?: ")
Great! Know we have some information for the program to work with.
Now it's time to define a function that will download the selected chapters.
To keep everything nice and clean, let us create a new text file with the name "functions.py" that is located in the same folder as "main.py".
Open the file and in the very first line add the following line:
import requests
This will import the requests library into our project.
Since we now can use requests we want to define a function with the input variables "link" and "file_name". Inside the download function want to add the following:
- Create a variable that stores the HTML code of the website
- Create a file that has the same name as the "file_name" parameter
- Write the HTML code to the file we've created
- Close the file
If we translate the steps into Python code, it'll look something like this:
def download(link, file_name):
page = requests.get(link).text
file = open(file_name, "w", encoding="utf8")
file.write(page)
file.close()
In the first line, we created a function "download" with the two parameters.
In the following line, we create a variable called "page" that has the value of the webpage's HTML code. With requests.get(link), we request the webpage. The .text at the end will limit the variable to only store the HTML code.
In the next line we create a file with the same name as the value stored in the "file_name" parameter. "w" means that the program has writing permission on the file and "encoding='utf8'" will prevent encoding errors.
Then we want to write the content of the "page" variable to file we've created.
At the very end of the function, we close the file. This step is optional since Python will close the file automatically after it exits the function, but to be safe, let's keep it.
If we would run the program now, nothing would happen because we don't call the download function. Since that is a bad ending for the first part of this tutorial series, let's fix this real quick.
Let's open "main.py" again.
Since we wrote the download function in a new file, we have to import "functions.py" to "main.py". So in the first line in main.py add the following line:
import functions
Now, main.py should look something like this:
import functions
info = {"link" : "http://www.wuxiaworld.com/emperor-index/emperor-chapter-", "ChapterName" : "emperor-chapter-", "NovelName" : "Emperor's Dominaion", "author" : "Yan Bi Xiao Sheng, Bao"}
starting_chapter = input("What chapter do you want to start at?: ")
ending_chapter = input("Till what chapter do you want to read?: ")
If you look at the download function we wrote earlier we see that it requires two input parameters "link" and "file_name".
Because we will delete the files, later on, we can set "file_name" to the chapter number.
Let's create the links.
For later use, we want to store all the links in a list, so the first step is to create an empty list with the name "link_list".
link_list = []
So far so good. The next step is to create a for loop that will count, starting from "starting_chapter" till "ending_chapter".
We want to add the number the for loop is currently at, at the end of the "base_link" from the "info" directory. Afterwards, it should add the generated link at the end of the "link_list" list. In Python it looks like this:
for s in range(int(starting_chapter), int(ending_chapter) + 1):
link_list.append(info["link"] + str(s))
If you look at the code you'll see that in the first line, we've converted the input, which is stored as a string, to an integer (number).
The range function sets the range of the for loop.
The .append() will add the content of the brackets to "link_list".
"s" is the number we're currently at.
Theoretically, you could just call the download function inside the for loop we just wrote but to visualize how the program works we'll write two.
Let's create a second for loop that will call the download function as often as the length of the "link_list".
for x in range(len(link_list)):
functions.download(link_list[x], str(x) + ".html")
Let me explain line 2. First, we refer to the download function in the function file with "functions.download".
Inside the brackets, we pass the parameters "link" and "file_name" to the download function. The link is stored at index x inside link_list. As the file name, we'll just use the chapter number + ".html" so we have a HTML file.
If you save the file now and run it, you'll see that inside the folder of your .py files, there are some HTML files. These contain the HTML code of the websites we downloaded.
In the next part, we will filter out everything we don't need so only the chapter content is left. We'll also take a look at how a .epub file is created and structured.
Posted on Utopian.io - Rewarding Open Source Contributors
Hey @bloodviolet I am @utopian-io. I have just upvoted you!
Achievements
Suggestions
Get Noticed!
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x
Thank you for the contribution. It has been approved.
You can contact us on Discord.
[utopian-moderator]