Learn Python Series (#33) - Data Science Part 4 - Pandas

scipio (65)in #utopian-io • 6 years ago (edited)

Learn Python Series (#33) - Data Science Part 4 - Pandas

Repository

What will I learn?

You will learn that pandas provides you with powerful tools you can use to filter data with;
individual and multiple conditions can be combined;
inline conditions can be used;
Boolean variables can be defined, useful for readability and re-use;
"auto-magical" conditional expression strings can be passed as arguments to the .query() method, but beware for potential security vulnerabilities in case you're accepting external (user / bot) input!

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.7) distribution, such as (for example) the Anaconda Distribution;
The ambition to learn Python programming.

Difficulty

Beginner, intermediate

Additional sample code files

The full - and working! - iPython tutorial sample code file is included for you to download and run for yourself right here:
https://github.com/realScipio/learn-python-series/blob/master/lps-033/learn-python-series-033-data-science-pt4-pandas.ipynb

The example CSV file that was used in the episodes #31 and #32 is copied to the lps-033 folder as well:
https://github.com/realScipio/learn-python-series/blob/master/lps-033/btcusdt_20190602_20190604_1min_hloc.csv

GitHub Account

https://github.com/realScipio

Learn Python Series (#33) - Data Science Part 4 - Pandas

Re-loading the actual BTCUSDT financial data using `pandas`

First, let's again read and open the file btcusdt_20190602_20190604_1min_hloc.csv found here on my GitHub account, (after having saved the file to your current working directory, from which you're also opening it using .read_csv()):

import pandas as pd
df = pd.read_csv('btcusdt_20190602_20190604_1min_hloc.csv', 
                 parse_dates=['datetime'], index_col='datetime')
df.head()

	open	high	low	close	volume
datetime
2019-06-02 00:00:00+00:00	8545.10	8548.55	8535.98	8537.67	17.349543
2019-06-02 00:01:00+00:00	8537.53	8543.49	8524.00	8534.66	31.599922
2019-06-02 00:02:00+00:00	8533.64	8540.13	8529.98	8534.97	7.011458
2019-06-02 00:03:00+00:00	8534.97	8551.76	8534.00	8551.76	5.992965
2019-06-02 00:04:00+00:00	8551.76	8554.76	8544.62	8549.30	15.771411

Executing `pandas` conditional filters

If you've been following along about pandas in this Data Science sub-series, hopefully by now you've realised that pandas provides you with quite some powerful built-in methods to analyse, enrich, clean-up, and (re-)model data sets with. Like Excel / OpenOffice spreadsheets and like database management systems, pandas is able to "filter" data.

There exist multiple techniques to execute conditional filters.

Inline conditionals

Using the same CSV file, containing 4320 1-minute ticks, fetched from Binance, on their BTCUSDT trading pair, on dates June 2, 2019 to June 4, 2019, you've already analysed (via the df.describe() statistical overview method) that the price of Bitcoin in the given interval was trading between 7481.02 and 8814.78.

df.describe()

	open	high	low	close	volume
count	4320.000000	4320.000000	4320.000000	4320.000000	4320.000000
mean	8354.033475	8359.921905	8347.543243	8353.818926	34.183344
std	358.395024	357.338897	359.911089	358.538551	54.520356
min	7490.200000	7533.430000	7481.020000	7494.110000	1.351415
25%	7985.045000	7990.270000	7979.205000	7984.997500	11.114809
50%	8519.605000	8524.985000	8513.490000	8518.845000	19.566122
75%	8661.080000	8666.992500	8656.007500	8661.200000	35.273851
max	8808.820000	8814.780000	8805.850000	8809.910000	949.563225

In order to derive how many 1-minute candles the data set contains in which the opening price (the price data in the "open" column) was higher than 8800, the condition itself is - of course:

df['open'] > 8800

To use this condition inline, pass that exact condition as a squared bracket "slicing argument" to the DataFrame df, like so:

df_over_8000 = df[df['open'] > 8800]
df_over_8000

	open	high	low	close	volume
datetime
2019-06-02 12:04:00+00:00	8801.84	8804.14	8774.17	8784.35	98.534386
2019-06-02 12:45:00+00:00	8804.44	8805.51	8799.00	8799.00	45.084453
2019-06-02 12:47:00+00:00	8807.61	8814.78	8805.85	8808.72	33.917674
2019-06-02 12:48:00+00:00	8808.82	8811.78	8797.34	8797.35	68.527716

I hadn't explicitly explained using the .count() method before, for example in the "basic pandas statistic operations"-section in the 2nd Data Science sub-series episode (found here), but using the .describe() stats overview method, which was explained before, it was mentioned and used already.

Regardless, if you want to count (to use for further processing or simply to return its value) the number of instances on which the condition is True, simply chain the .count() method, like so:

df[df['open'] > 8800]['open'].count()

And indeed, the df_over_8000 result table printed 4 (in total) 1-minute data ticks correctly, containing opening prices higher than 8800.

Filtering using multiple inline conditions

In order to filter (= keep) using two or more conditions, use multiple conditions as "filtering slice" via the & (and) and | (or) bitwise operators.

Say, for example, you want to filter all 1-minute ticks where opening price was above 8800 and trading volume was above 50 Bitcoins, execute:

df_over_8000_high_volume = df[ (df['open'] > 8800) & (df['volume'] > 50) ]
df_over_8000_high_volume

	open	high	low	close	volume
datetime
2019-06-02 12:04:00+00:00	8801.84	8804.14	8774.17	8784.35	98.534386
2019-06-02 12:48:00+00:00	8808.82	8811.78	8797.34	8797.35	68.527716

By "eye-balling" the first (single-condition) example output, you can see the results are correct.

If we would replace & for |, we're filtering something completely different, being:

let's keep all rows in which either opening price is above 8800,
or
volume is above 50:

df_over_8000_OR_high_volume = df[ (df['open'] > 8800) | (df['volume'] > 50) ]
df_over_8000_OR_high_volume.head(10)

	open	high	low	close	volume
datetime
2019-06-02 01:22:00+00:00	8586.33	8595.12	8583.23	8594.99	51.051786
2019-06-02 03:02:00+00:00	8535.29	8550.20	8529.55	8545.56	56.543773
2019-06-02 07:33:00+00:00	8588.05	8599.85	8586.00	8599.36	61.909827
2019-06-02 07:34:00+00:00	8599.56	8615.00	8598.09	8614.65	197.822432
2019-06-02 07:35:00+00:00	8611.87	8652.75	8611.05	8648.88	292.018398
2019-06-02 07:36:00+00:00	8648.90	8680.00	8648.87	8670.11	239.552508
2019-06-02 07:37:00+00:00	8673.25	8673.25	8653.92	8661.19	96.012789
2019-06-02 07:38:00+00:00	8661.68	8676.26	8661.68	8674.99	54.226364
2019-06-02 07:39:00+00:00	8674.99	8720.00	8671.88	8715.01	188.849318
2019-06-02 07:40:00+00:00	8710.01	8714.87	8685.24	8703.31	118.652782

df_over_8000_OR_high_volume['open'].count()

Remembering that df_over_8000_high_volume returned only 2 rows in which both conditions were met, if we substract those 2 from the individual conditionals combined, indeed we get 671 as a result:

price_over_8800 = df[(df['open'] > 8800)]['open'].count()
volume_over_50 = df[df['volume'] > 50]['open'].count()
num_df_over_8000_high_volume = df_over_8000_high_volume['open'].count()

print(f"price_over_8800: {price_over_8800}")
print(f"volume_over_50: {volume_over_50}")
print(f"num_df_over_8000_high_volume: {num_df_over_8000_high_volume}")

price_over_8800: 4
volume_over_50: 669
num_df_over_8000_high_volume: 2

In 4 instances, price was over 8800;
In 669 instances, volume was over 50;
In 2 instances, price was over 8800 and volume over 50

Ergo: (669 + 4) - 2 = 671

Filtering passing Boolean variables

As you may have noticed, filtering by multipal inline conditions very quickly leads to long lines of code. For readability matters, it's also possible to assign Boolean variables, and pass those instead. The following syntax (same example) also works:

price_over_8800 = df['open'] > 8800
volume_over_50 = df['volume'] > 50
df_over_8000_high_volume = df[price_over_8800 & volume_over_50]
df_over_8000_high_volume

	open	high	low	close	volume
datetime
2019-06-02 12:04:00+00:00	8801.84	8804.14	8774.17	8784.35	98.534386
2019-06-02 12:48:00+00:00	8808.82	8811.78	8797.34	8797.35	68.527716

--- this looks much better!

Filtering using the `.query()` method

A third way to filter DataFrame data, is by using the .query() method. .query() expects to receive a string as its expr= argument, using Boolean (instead of bitwise) operators, ergo: and / or.

Also, the column names are referenced inside the conditional expression string. Like so (again: same example):

query_expression = 'open > 8800 and volume > 50'
df.query(query_expression)

	open	high	low	close	volume
datetime
2019-06-02 12:04:00+00:00	8801.84	8804.14	8774.17	8784.35	98.534386
2019-06-02 12:48:00+00:00	8808.82	8811.78	8797.34	8797.35	68.527716

Nota bene: Please stand still for a second and realise that, in order to make the last mentioned .query() method work, only a "humanized" looking string needed to be passed-in as an argument. It seems to "just work auto-magically"...

Under the hood, .query() utilises the (not yet explained) .eval() method of pandas, which is - in general - able to evaluate strings to derive column-wise vectorised operations from. How magical and convenient that may be (it is very cool actually!) it also allows .eval() to execute arbitrary code. So be careful when using it on an interface where others (users, bots) are allowed to pass their (potentially dangerous) input strings as .eval() arguments!

What did we learn, hopefully?

Hopefully you've learned that pandas provides you with techniques for filtering data based on one or more conditional filters. When using bitwise operators & / |, or boolean operators and / or (using .query()), keep in mind you're writing expressions about which rows you want to keep (not drop).

Thank you for your time!

Curriculum (of the `Learn Python Series`):

#tutorials #steemdev #open-source #palnet

6 years ago in #utopian-io by scipio (65)

$13.85

Sort:

Trending

[-]

portugalcoin (72) 6 years ago

Thank you for your contribution @scipio.
After reviewing your tutorial we suggest the following points listed below:

In your tutorial always use the third person, so that your text becomes very professional.
It was interesting to have some images in the next tutorial. The tutorial is a bit long and quite massive.
Thank you for following some suggestions we gave in your previous tutorial.

Looking forward to your upcoming tutorials.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.

Need help? Chat with us on Discord.

[utopian-moderator]

$8.31

[-]

utopian-io (71) 6 years ago

Thank you for your review, @portugalcoin! Keep up the good work!

$0.00

[-]

artturtle (63) 6 years ago

Thank you scipio! You've just received an upvote of 6% by artturtle!

Learn how I will upvote each and every one of your posts

Please come visit me to see my daily report detailing my current upvote power and how much I'm currently upvoting.

$0.06

1 vote

[-]

steem-ua (64) 6 years ago

Hi @scipio!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your UA account score is currently 7.027 which ranks you at #73 across all Steem accounts.
Your rank has dropped 1 places in the last three days (old rank 72).

In our last Algorithmic Curation Round, consisting of 200 contributions, your post is ranked at #39.

Evaluation of your UA score:

Your follower network is great!
The readers appreciate your great work!
Good user engagement!

Feel free to join our @steem-ua Discord server

$0.06

1 vote

[-]

steem-plus (73) 6 years ago

Hi, @scipio!

You just got a 0.05% upvote from SteemPlus!
To get higher upvotes, earn more SteemPlus Points (SPP). On your Steemit wallet, check your SPP balance and click on "How to earn SPP?" to find out all the ways to earn.
If you're not using SteemPlus yet, please check our last posts in here to see the many ways in which SteemPlus can improve your Steem experience on Steemit and Busy.

$0.00

[-]

utopian-io (71) 6 years ago

Hey, @scipio!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

$0.00

Learn Python Series (#33) - Data Science Part 4 - Pandas

Learn Python Series (#33) - Data Science Part 4 - Pandas

Repository

What will I learn?

Requirements

Difficulty

Additional sample code files

GitHub Account

Learn Python Series (#33) - Data Science Part 4 - Pandas

Re-loading the actual BTCUSDT financial data using pandas

Executing pandas conditional filters

Inline conditionals

Filtering using multiple inline conditions

Filtering passing Boolean variables

Filtering using the .query() method

What did we learn, hopefully?

Thank you for your time!

Curriculum (of the Learn Python Series):

Thank you scipio! You've just received an upvote of 6% by artturtle!

Learn how I will upvote each and every one of your posts

Hi @scipio!

Evaluation of your UA score:

Re-loading the actual BTCUSDT financial data using `pandas`

Executing `pandas` conditional filters

Filtering using the `.query()` method

Curriculum (of the `Learn Python Series`):