SQL 101 with duckdb and jupysql

python
sql
Author

KO

Published

August 17, 2023

SQL 101 with duckdb and jupysql

This notebook sql and matplotlibs its way to:

Jupyter + duckdb setup for sql

Before the sql even starts, I want to use jupyter notebooks + sql. I’m using duckdb inside jupyter to look at a few sql datasets.

First up, make a new env and install the necessary libraries if necessary by:

mamba create -n sql101 python=3.11 jupyterlab duckdb duckdb-engine jupysql matplotlib openpyxl plotly
mamba active sql101

Setting up duckdb for jupyter

Once all the packages have been installed, setup jupyter to use duckdb. You don’t need to use all the three config options below, just noting some useful ones for future reference.

the %config SqlMagic.autopandas in particular is useful - the sql query is returned as a pandas dataframe. This is useful as even though right now I have a small dataset, on a real project duckdb can talk to remote databases, or query parquet files on the web, do a query and only pull down the subset of data selected in the query.

After that, the pandas dataframe is ideal for plotting and other pythonic stuff.

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

%config SqlMagic.displaycon = False  # hides con info
%config SqlMagic.autopandas = True   # output as df
#%config SqlMagic.feedback = False   # 

%sql duckdb://

Data - Big Mac index

I’m using the Economist’s big mac gdp source data. Pandas can directly download the csv file into a dataframe which duckdb can read, but trying to stick with most sql here, so I’m downloading the csv file to disk and reading it with duckdb directly.

Why the big mac index? To calculate the big mac index we need to use correlated subqueries, which is a pretty advanced topic!

The final output figure Figure 1 is somewhere below, this notebook works through getting there slowly:

url = "https://github.com/TheEconomist/big-mac-data/raw/master/source-data/big-mac-source-data-v2.csv"
url = "https://github.com/TheEconomist/big-mac-data/raw/january-2023-update/source-data/big-mac-source-data-v2.csv"

fname = "big-mac-source-data-v2.csv"

with open(fname, "wb") as file:
    file.write(requests.get(url).content)

When using %%sql cells in jupyter, put python variables in brackets like so: {{var_name}}.

The read_csv_auto should auto-magically read and convert the csv to a sql table:

%%sql
-- make a table bigmac from the csv file
create table if not exists bigmac as select * from read_csv_auto('{{fname}}');

select * from bigmac limit 3; -- comments can go here too!
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac
0 Argentina ARG ARS 2.50 1.00 8709.072 8709.072 2000-04-01 2.500000 0.116071 7803.328512
1 Australia AUS AUD 2.59 1.68 21746.809 33698.764 2000-04-01 1.541667 -0.311756 29144.876973
2 Brazil BRA BRL 2.95 1.79 3501.438 6351.375 2000-04-01 1.648045 -0.264266 4822.738983

So now we have a table inside a sql database.

Note

sql convention is to use a lot of CAPITALS, but for fast typing and a lack of an sql formatter, I’m going lowercase. Ideally your sql writing thingamjig should have a formatter which does that for you.

EDA and data cleanup

Data always has some issues, so taking a look:

Metadata about the table

A sql database can have many tables, so its useful to take a look at whats there:

%sqlcmd tables
Name
bigmac

This should describe the table:

%sql describe bigmac;
Success

The describe bigmac code should spit out table info, seems to be some kind of bug, but moving on, we can get the gist using the information schema:

%%sql 
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'bigmac';
column_name data_type
0 name VARCHAR
1 iso_a3 VARCHAR
2 currency_code VARCHAR
3 local_price DOUBLE
4 dollar_ex DOUBLE
5 GDP_dollar DOUBLE
6 GDP_local DOUBLE
7 date DATE

The above is not that useful, a more informative look which counts the values is:

%sqlcmd profile --table 'bigmac'
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac
count 1918 1918 1918 1918 1918 1918 1918 1918 1918 1918 1918
unique 73 72 56 655 1473 1141 1141 37 1827 1835 1790
top Argentina ARG EUR nan nan nan nan 2018-07-01 nan nan nan
freq 37 37 350 nan nan nan nan 72 nan nan nan
mean nan nan nan 7559.2864 2870.1936 25020.0618 2615423.9573 nan 3.4740 -0.1340 26351.7044
std nan nan nan 154285.8636 58921.0338 21466.5984 9471890.5225 nan 1.2289 0.3013 19541.3083
min nan nan nan 1.05 0.3008 689.826 3004.341 nan 0.64 -0.7409 0.0022
25% nan nan nan 4.4200 1.0000 6598.8800 33698.7640 nan 2.5118 -0.3530 8516.5585
50% nan nan nan 14.7500 5.3375 18167.3440 70506.5530 nan 3.4129 -0.1755 23327.7564
75% nan nan nan 85.0000 32.1055 41469.7730 358689.5130 nan 4.3132 0.0220 40981.9351
max nan nan nan 4000000.0 1600500.0 102576.68 85420189.717 nan 8.3117 1.5347 98893.859

Data cleanup

So we have 3 rows without a GDP_dollar, and at least one local_price and dollar_ex of 0, so dropping those rows. We also need GDP_local to be able to adjust the bigmac index, so dropping any nulls in that row too.

%%sql 
delete from bigmac where 
GDP_dollar is null or GDP_local is null or local_price<=0 or dollar_ex<=0;
Success
rows = %sql select count(*) as N from bigmac;
f"The data set has a total of {rows.N.loc[0]} rows."
'The data set has a total of 1918 rows.'

It’s a pretty clean and tidy dataset, we did loose a few rows, which in a real world case might bear more investigation, but moving on…

%sql select count(distinct name) as Countries, min(date), max(date) from bigmac;
Countries min(date) max(date)
0 73 2000-04-01 2022-07-01

We have data for 73 countries from April 2000 to July 2022.

Calculations in SQL

Adding some columns by calculating new ones in sql. This is where the sql starts.

Big mac price in USD

We want to get the US dollar price for a big mac in every country, which is easy as our data contains the local_price and dollar exchange rate. So we add a new column: dollar_price = local_price / dollar_ex.

In sql you can’t just add a column, you first have to add it with a value type. In pandas you can do this in one step: df["dollar_price"] = df.local_price / df.dollar_ex.

So here we add a new col dollar_price and calc its value:

%%sql
alter table bigmac 
    add column if not exists dollar_price DOUBLE;
update bigmac 
    set dollar_price = local_price / dollar_ex;

select * from bigmac limit 3;
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac
0 Argentina ARG ARS 2.50 1.00 8709.072 8709.072 2000-04-01 2.500000 0.116071 7803.328512
1 Australia AUS AUD 2.59 1.68 21746.809 33698.764 2000-04-01 1.541667 -0.311756 29144.876973
2 Brazil BRA BRL 2.95 1.79 3501.438 6351.375 2000-04-01 1.648045 -0.264266 4822.738983

The line if not exists is optional, but useful in this context as it prevents errors if I rerun the notebook.

Calculating the big mac index

Now we want to account for purchasing power parity by divinding the dollar price with the base currencies price. The Economist uses five base currencies: ('USD', 'EUR', 'GBP', 'JPY', 'CNY')

For simplicity’s sake, I’ll stick with just USD. This is a easy calc to do in python, in SQL its a bit messy…, so I am using Correlated Subqueries in SQL to do this.

The dollar_price on each row is divided by the dollar_price in USD. Since we have multiple dates, the query below matches on both the date and the country code.

The inner query returns the US dollar_price for each date, so every row in the table gets divided by the right dates USD price.

Note

This only works because for each unique date, there is only one row for USA.

Finally, we minus the number by -1, as we divide the US price by its own, so its always at 1. By subtracting -1, we set that to zero, which makes it wasy to see how the other countries are over or under that.

%%sql
ALTER TABLE bigmac ADD COLUMN IF NOT EXISTS USD DOUBLE;

update bigmac as b1
    set USD = dollar_price / 
            (select b2.dollar_price from bigmac as b2
            where b2.date = b1.date
            and b2.iso_a3 = 'USA') - 1;

select * from bigmac order by date desc, name desc limit 6;
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac
0 Vietnam VNM VND 69000.00 23417.00000 3724.543 8.542019e+07 2022-07-01 2.946577 -0.427849 6375.564885
1 Uruguay URY UYU 255.00 41.91000 16756.344 7.291942e+05 2022-07-01 6.084467 0.181450 14726.863618
2 United States USA USD 5.15 1.00000 69231.400 6.923140e+04 2022-07-01 5.150000 0.000000 69231.400000
3 United Arab Emirates ARE AED 18.00 3.67305 42883.686 1.574903e+05 2022-07-01 4.900559 -0.048435 45059.735594
4 Turkey TUR TRY 47.00 17.56500 9527.683 8.448134e+04 2022-07-01 2.675776 -0.480432 9256.997784
5 Thailand THA THB 128.00 36.61250 7336.086 2.313028e+05 2022-07-01 3.496074 -0.321151 9306.323554

I did a few random checks and looks like the formula did use the right USD price to calculate the USD offset.

Adjusted GDP based on big mac index

Big Mac adjusted per capita GDP is the GDP in local currency divided by the exchange rate as determined by big macs (price in local currency divivded by price in US).

The formula is: GDP_Local / (local_price / dollar_price)

%%sql
ALTER TABLE bigmac ADD COLUMN IF NOT EXISTS GDP_bigmac DOUBLE;

update bigmac as b1
    set GDP_bigmac = GDP_local / 
            (local_price / (select b2.dollar_price from bigmac as b2
            where b2.date = b1.date
            and b2.iso_a3 = 'USA'));

select * from bigmac order by date desc, name desc limit 6;
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac
0 Vietnam VNM VND 69000.00 23417.00000 3724.543 8.542019e+07 2022-07-01 2.946577 -0.427849 6375.564885
1 Uruguay URY UYU 255.00 41.91000 16756.344 7.291942e+05 2022-07-01 6.084467 0.181450 14726.863618
2 United States USA USD 5.15 1.00000 69231.400 6.923140e+04 2022-07-01 5.150000 0.000000 69231.400000
3 United Arab Emirates ARE AED 18.00 3.67305 42883.686 1.574903e+05 2022-07-01 4.900559 -0.048435 45059.735594
4 Turkey TUR TRY 47.00 17.56500 9527.683 8.448134e+04 2022-07-01 2.675776 -0.480432 9256.997784
5 Thailand THA THB 128.00 36.61250 7336.086 2.313028e+05 2022-07-01 3.496074 -0.321151 9306.323554

And presto, we have a gdp per big mac, which for most countries is significantly different from each other. A couple of plots to eyeball this:

Code
df = %sql select * from bigmac;

fig, ax = plt.subplots(figsize=(8, 4))
ax.spines[["top", "right"]].set_visible(False)
ax.set_title("GDP Bigmac vs dollar", loc="left", weight="bold")

for name in ["Pakistan", "India"]:
    d = df.query("name == @name")
    ax.plot(d.date, d.GDP_bigmac, alpha=0.8, label=f"{name}")
    ax.fill_between(d.date, d.GDP_bigmac, d.GDP_dollar, alpha=0.12)
    #ax.plot(d.date, d.GDP_dollar, alpha=0.6, ls="--")

ax.legend();

Code
country = "Pakistan"
df = %sql select * from bigmac where name in ('{{country}}')
fig = px.line(df, x="date", y=["GDP_dollar", "GDP_bigmac"], 
              title=f"{country}: GDP local vs bigmac",
              markers=True)
fig.show()

This is interesting, you can see the indian dollar even though its close to the Pakistan one in GDP_dollar terms, the big mac index GDP says it buys a lot more.

TODO: Adjusted big mac index

This uses the big mac adjusted per capita GDP price calculated above to get a adjusted dollar price for a bigmac. This looks very similar to first big mac index so left this calc for a future date.

See Calculating the adjusted index for details.

Moar data - adding in country data

There are a lot of countries in the index, which will make the graph a bit messy, so to add some groups to the counties (and do more sql!) I’m going to use the world banks income dataset, primary to add regions and income groups.

df_income = pd.read_excel(
    "https://datacatalogfiles.worldbank.org/ddh-published/0037712/DR0090755/CLASS.xlsx"
)

df_income = (
    df_income.dropna(subset=["Income group", "Economy"])
    .drop(columns="Lending category")
    .rename(columns={"Code": "iso_a3", "Income group": "income_group"})
)

# adding in the missing eurozone
euro_row = ["Eurozone", "EUZ", "Europe", "High income"]
df_income.loc[len(df_income)] = euro_row

df_income.tail(3)
Economy iso_a3 Region income_group
215 South Africa ZAF Sub-Saharan Africa Upper middle income
216 Zambia ZMB Sub-Saharan Africa Lower middle income
217 Eurozone EUZ Europe High income

No need to do this step, but for more sql goodness, lets add this as a table in our database so we can practice joining two tables:

%%sql
create table income as select * from df_income;
select * from income limit 3;
RuntimeError: (duckdb.CatalogException) Catalog Error: Table with name "income" already exists!
[SQL: create table income as select * from df_income;]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community

Joining the two tables

Here we join these two tables into a dataframe for plotting.

I am using a left join as I want to keep all data in the bigmac table, and just add country info.

Warning

Don’t write sql like below, use capitals and linebreaks!

df_all = %sql select * from bigmac left join income on bigmac.iso_a3 = income.iso_a3 order by bigmac.date asc;
df_all.head(2)
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac Economy iso_a3_2 Region income_group
0 Argentina ARG ARS 2.50 1.00 8709.072 8709.072 2000-04-01 2.500000 0.116071 7803.328512 Argentina ARG Latin America & Caribbean Upper middle income
1 Australia AUS AUD 2.59 1.68 21746.809 33698.764 2000-04-01 1.541667 -0.311756 29144.876973 Australia AUS East Asia & Pacific High income

Everything looks good here, we now have a dataframe which combines the two tables and is ready to plot.

Save data to disk

For using in a future something.

df_all.to_parquet("bigmac_index_data_ko.parquet")

Big Mac Index chart using Matplotlib

This is going to be ugly, but its good matplotlib practice. First up, for the economist plot we just want the latest date:

df = df_all.query("date == date.max()").sort_values("USD")
df.head(2)
name iso_a3 currency_code local_price dollar_ex GDP_dollar GDP_local date dollar_price USD GDP_bigmac Economy iso_a3_2 Region income_group
1899 Romania ROU RON 11.0 4.82175 14667.089 6.102120e+04 2022-07-01 2.281329 -0.557023 28569.017768 Romania ROU Europe & Central Asia High income
1876 Indonesia IDN IDR 35000.0 14977.50000 4356.560 6.233566e+07 2022-07-01 2.336839 -0.546245 9172.246719 Indonesia IDN East Asia & Pacific Upper middle income
Code
fig, ax = plt.subplots(figsize=(5, 10))
ax.spines[["top", "right", "left"]].set_visible(False)

ax.set_title(
    "Big Mac Index", loc="left", weight="bold", ha="left", fontsize=15, x=-0.42
)

ax.text(
    x=-0.2,
    y=0.915,
    s="Unadjusted index for 2022",
    transform=fig.transFigure,
    ha="left",
    fontsize=9,
    alpha=0.8,
)


# plotting the actual data

# assigning each income group a color
color_list = list(mcolors.TABLEAU_COLORS.values())
colors = {country: color_list[i] for i, country in enumerate(df.income_group.unique())}

# the actual plot
ax.barh(
    y=df.name,
    width=df.USD,
    alpha=0.85,
    height=0.78,
    color=[colors[income] for income in df.income_group],
    zorder=2,
)

# x axis things
ax.xaxis.tick_top()
ax.set_xlabel(
    "                 <-- Undervalued              Overvalued --> ",
    labelpad=10,
    fontsize=11,
)
ax.xaxis.set_label_position("top")

# fix and label the y axis
# ax.set_yticklabels(df.name, ha="left")
ax.yaxis.set_tick_params(pad=2, labelsize=9, bottom=False)
ax.set_ylim(-1, df.shape[0])


# legend for income groups
handles = [mpatches.Patch(color=colors[i]) for i in colors]
labels = [f"{i}" for i in colors]
ax.legend(handles, labels, fontsize=8)
ax.grid(which="major", axis="x", color="#758D99", alpha=0.5, zorder=1)

# Set source text
ax.text(
    x=-0.2,
    y=0.08,
    s="""Source: "Big Mac Index" via economist.com""",
    transform=fig.transFigure,
    ha="left",
    fontsize=9,
    alpha=0.7,
);

Figure 1: Big Mac Index

phew! That is a long image, with too many countries, and its a bit ugly, so if this was going to be used somewhere, I’d filter the countries and produce a more sensible sized graph.

Big Mac Index using Plotly

I should have probably gone with plotly first for this, but here goes:

fig = px.bar(
    df, x="USD", y="name", width=600, height=800, text="name", color="income_group"
)
fig.update_layout(title_text="Big Mac Index", yaxis_categoryorder="total ascending")
fig.show()

Big mac index with trendlines

This is a bit messy again… but here goes some plotly practice:

Code
names = df_all.name.unique()
show_countries = ["Pakistan", "India"]

fig = px.scatter(df_all, x="date", y="USD", color="name", opacity=0.8,
                 trendline="lowess")

# add a text label to the end of each scatter plot
max_date = df_all.date.max()
df = df_all.query("date == @max_date")

for name in df.name:
    fig.add_trace(go.Scatter(opacity=0.6,
        x=df.query("name == @name").date,
        y=df.query("name == @name").USD + 0.15,
        mode="text",
        legendgroup=name,showlegend=False,
        name=name,
        text=name,
        textposition="top center",
    ))

# turn off plots
fig.for_each_trace(lambda trace: trace.update(visible='legendonly')
                   if trace.name not in show_countries else ())

#fig.for_each_trace(lambda trace: trace.update(opacity=0.09)
#                   if trace.name not in show_countries else ())

fig.add_hline(y=0, opacity=0.5, visible=True, line_width=1)

fig.update_layout(title="Big Mac Index with country selector", template="simple_white")
fig.update_yaxes(tickformat=".0%", dtick=0.2, range=[-0.7, 0.7], 
                 showspikes=True, spikethickness=0.5)
fig.update_xaxes(showspikes=True, spikethickness=0.5)


fig.add_trace(go.Scatter(
    x=[2022],
    y=[0.01],
    mode="text",
    name="US dollar",
    text=["US dollar"],
    textposition="top center"
))

fig.show()
Code
names = df_all.name.unique()
show_countries = ["Pakistan"]

colors = ["red" if usd < 0 else "#1f77b4" for usd in df_all.USD]
fig = px.scatter(df_all, x="date", y="USD", opacity=0.05).update_traces(
    marker=dict(color=colors))

# add a text label to the end of each scatter plot
max_date = df_all.date.max()
df = df_all.query("date == @max_date")

for name in show_countries:
    dff = df.query("name == @name")

    fig.add_traces(
        px.line(df_all.query("name==@name"), 
                   x="date", y="USD").data[:]
    )
    
    fig.add_trace(go.Scatter(opacity=0.6,
        x=dff.date,
        y=dff.USD + 0.15,
        mode="text",
        legendgroup=name,showlegend=False,
        name=name,
        text=name,
        textposition="top center",
    ))


# turn off plots
#fig.for_each_trace(lambda trace: trace.update(visible='legendonly')
#                   if trace.name not in show_countries else ())

#fig.for_each_trace(lambda trace: trace.update(opacity=0.09)
#                   if trace.name not in show_countries else ())

fig.add_hline(y=0, opacity=0.5, visible=True, line_width=0.8)

fig.update_layout(title="Big Mac Index highlighting a country", template="simple_white")
fig.update_yaxes(tickformat=".0%", dtick=0.2, range=[-0.7, 0.7], 
                 showspikes=True, spikethickness=0.5)
fig.update_xaxes(showspikes=True, spikethickness=0.5)


fig.add_trace(go.Scatter(
    x=[2022],
    y=[0.01],
    mode="text",
    name="US dollar",
    text=["US dollar"],
    textposition="top center"
))

fig.show()

The above graph has some issues, to be fixed later. Looks like this would be simpler in Altair instead of plotly.

from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=2)

fig = px.choropleth(df, color="USD", locations="iso_a3", 
                    title="Big Mac Index", hover_name="name")
fig.update_layout(height=600, width=800)
fig.show()

Altair

Altair seems useful for this too… trying it out

Code
import altair as alt

highlight = alt.selection_point(on='mouseover', fields=['name'], nearest=True)

base = alt.Chart(df_all).encode(
    x='date:T',
    y='USD:Q',
    color='name:N',
    tooltip=['name', 'date','USD']
)

points = base.mark_circle().encode(
    opacity=alt.value(0.01)
).add_params(
    highlight
).properties(
    width=600
)

lines = base.mark_line().encode(
    size=alt.condition(~highlight, alt.value(0.2), alt.value(3))
)

points + lines
name = "Pakistan"
df = df_all.query("name == @name")

chart = alt.Chart(df).mark_area(
    opacity=0.7, color="red", line=True).encode(
    x="date:T",
    y=alt.Y("USD:Q")
)

chart

the end

Can play around a bit more with this data… but heaps enough for now. The economist has a great dashboard to this very simple dataset, so a good future dashboarding exercise is to make something which lookls like that.