SQL Databases for Students and Educators

Pachá

2021-02-02

Update 2022-09-30: The SQL Server educational license expired. From now on, only PostgreSQL and MariaDB databases will be available.

About

Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

I’ve decided to host different light/medium size by using PostgreSQL and MariaDB (a modern fork of MySQL) backends. I tend to prefer PostgreSQL because it easier to configure on client and server side, without strange tricks (see the Problems with MySQL/Maria and SQL Server section).

Why these database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

Please send me a tweet if you need the connection strings for your lectures or workshops. My Mastodon username is @pacha.

You can contribute to keep it up with the hosting costs here:

Available Databases

Open Trade Statistics

Backends

  • PostgreSQL
  • MariaDB

It’s the database that powers Open Trade Statistics and contains international trade data for commodities since 2002.

This database can be helpful for evaluating the impact of free trade agreements, and other policy relevant modelling. The database is a modified version of the United Nations Commodity Trade Statistics Database (UN COMTRADE) and is targeted for Latin American Universities, who often have limited or no access to the original resource.

The full description for each table is:

table description source
commodities Commodities metadata (HS codes, 6 digits long) UN Comtrade (with modifications)
commodities_short Commodities metadata (HS codes, 4 digits long) UN Comtrade (with modifications)
countries Countries metadata UN Comtrade (with modifications)
distances Distance between countries, alongside continuity, common language and if the countries were the same once CEPII (with modifications)
partners Partners for a given year UN Comtrade (with modifications)
reporters Reporters for a given year UN Comtrade (with modifications)
sections Sections metadata (HS codes) UN Comtrade (with modifications)
sections_colors Colors for sections (i.e. useful to visualize data) Open Trade Statistics
rtas Regional Trade Agreements per pair of countries and year Design of Trade Agreements (DESTA) Database
tariffs Most Favoured Nation tarrifs (Year, Reporter and Commodity Code) World Integrated Trade Solution
years Minimum and maximum years with available data Open Trade Statistics
yc Commodity trade at aggregated level (Year and Commodity), raw trade flows (i.e. with transportation costs) Open Trade Statistics
yr Reporter trade at aggregated level (Year and Reporter), raw trade flows (i.e. with transportation costs) Open Trade Statistics
yrc Reporter trade at commodity level (Year, Reporter and Commodity Code), raw trade flows (i.e. with transportation costs) Open Trade Statistics
yrp Reporter-Partner trade at aggregated level (Year, Reporter and Partner), raw trade flows (i.e. with transportation costs) Open Trade Statistics
yrpc Reporter-Partner trade at commodity level (Year, Reporter, Partner and Commodity Code), raw trade flows (i.e. with transportation costs) Open Trade Statistics

Access these tables by adapting the next code:

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "tradestatistics",
  host = "databases.pacha.dev"
)

dbListTables(con)
##  [1] "distances"         "vaccine_inputs"    "commodities"      
##  [4] "commodities_short" "gdp"               "rtas"             
##  [7] "sections"          "sections_colors"   "yc"               
## [10] "countries"         "countries_colors"  "yrpc"             
## [13] "tariffs"           "yr"                "yrc"              
## [16] "gdp_deflator"      "yrp"
tbl(con, "yrpc") %>% glimpse()
## Rows: ??
## Columns: 7
## Database: postgres  [teacher@databases.pacha.dev:5432/tradestatistics]
## $ year                <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 20…
## $ reporter_iso        <chr> "abw", "abw", "abw", "abw", "abw", "abw", "abw", "…
## $ partner_iso         <chr> "ant", "arg", "atg", "aus", "aut", "bra", "brb", "…
## $ section_code        <chr> "99", "99", "99", "99", "99", "99", "99", "99", "9…
## $ commodity_code      <chr> "999999", "999999", "999999", "999999", "999999", …
## $ trade_value_usd_imp <dbl> 242559, 0, 346, 659, 2056, 71380, 1112, 95223, 102…
## $ trade_value_usd_exp <dbl> 38553, 2313, 1006, 419, 0, 0, 0, 0, 0, 0, 0, 40553…
dbDisconnect(con)

Intendo

Backends

  • PostgreSQL
  • MariaDB

It’s a dummy database that originally we used at the My Organization’s First R Package and contains managerial information of a company called Intendo, which creates mobile games. The tables contain information for the 2015 year for all sorts of information for their only game: Super Jetroid. See the recent updated to this database at intendo.

This database can be helpful for workshops and showing how to create an R package focused on the particular needs of an organization, including easier data access, shared functions for data transformation and analysis, and a common look and feel for reporting.

Check the structure of the tables in the next preview. The tables ending with “_f” contain deliberated errors.

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "intendo",
  host = "databases.pacha.dev"
)

dbListTables(con)
##  [1] "sj_all_revenue_small_f"   "sj_users_daily_small"    
##  [3] "sj_user_summary_medium"   "sj_all_sessions_large_f" 
##  [5] "sj_users_daily_large_f"   "sj_all_sessions_xlarge_f"
##  [7] "sj_user_summary_xlarge_f" "sj_users_daily_xlarge_f" 
##  [9] "sj_all_revenue_small"     "sj_all_revenue_medium_f" 
## [11] "sj_all_sessions_small"    "sj_users_daily_medium_f" 
## [13] "sj_all_sessions_large"    "sj_users_daily_large"    
## [15] "sj_all_sessions_small_f"  "sj_all_revenue_medium"   
## [17] "sj_users_daily_medium"    "sj_user_summary_large_f" 
## [19] "sj_all_revenue_xlarge_f"  "sj_user_summary_xlarge"  
## [21] "sj_users_daily_xlarge"    "sj_user_summary_small_f" 
## [23] "sj_all_sessions_medium"   "sj_all_revenue_large"    
## [25] "sj_user_summary_small"    "sj_users_daily_small_f"  
## [27] "sj_user_summary_medium_f" "sj_all_sessions_medium_f"
## [29] "sj_all_revenue_large_f"   "sj_user_summary_large"   
## [31] "sj_all_revenue_xlarge"    "sj_all_sessions_xlarge"
tbl(con, "sj_user_summary_small") %>% glimpse()
## Rows: ??
## Columns: 6
## Database: postgres  [teacher@databases.pacha.dev:5432/intendo]
## $ player_id   <chr> "YMKOHGVFZWLJ836", "SBPFOHCVMNQI568", "DLCKJEZRSIGW561", "…
## $ first_login <dttm> 2015-01-01 01:45:33, 2015-01-01 10:13:16, 2015-01-02 14:1…
## $ start_day   <date> 2015-01-01, 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-0…
## $ country     <chr> "United States", "Japan", "United States", "India", "Switz…
## $ acquisition <chr> "organic", "apple", "organic", "organic", "google", "apple…
## $ device_name <chr> "Samsung Galaxy A7", "Sony Experia Z1 Compact", "Sony Expe…
dbDisconnect(con)

Financial (aka Loan Application)

Backends

  • PostgreSQL
  • MariaDB

A database that contains Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.

This database comes from the Relational Dataset Repository where you can find the database schema and more information. The database full name is “PKDD’99 Financial Dataset or Loan Application”, and this particular version corresponds to Financial_ijs by Janez Kranjc.

Access these tables by adapting the next code:

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "financial",
  host = "databases.pacha.dev"
)

dbListTables(con)
## [1] "accounts"  "cards"     "clients"   "disps"     "districts" "loans"    
## [7] "orders"    "tkeys"     "trans"
tbl(con, "clients") %>% glimpse()
## Rows: ??
## Columns: 4
## Database: postgres  [teacher@databases.pacha.dev:5432/financial]
## $ id           <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ birth_number <chr> "706213    ", "450204    ", "406009    ", "561201    ", "…
## $ district_id  <int64> 18, 1, 1, 5, 5, 12, 15, 51, 60, 57, 57, 40, 54, 76, 21,…
## $ tkey_id      <int64> NA, 21, 21, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

CENSO 2017

Backends

  • PostgreSQL
  • MariaDB

This database combines two source, censo2017 R package (uses DuckDB as locla backend) and censo2017 cartographies.

Please read the documentation in the mentioned sources. Because of GIS structure in some tables, I’m bounded to use PostgreSQL with PostGIS.

SQL dumps

All the dumps are available for PostgreSQL to promote its use. I had just too many problems with MariaDB/MySQL configuration and even more with SQL Server (see the Problems with MySQL/Maria and SQL Server section).

Open Trade Statistics

Intendo

Financial

CENSO2017

Using the dumps locally

Restore dumps

PostgreSQL

sudo -i -u postgres
gunzip -c intendo-postgres.sql.gz | psql intendo

Create users

You can create a generic user (let’s say student) and grant read-only access.

PostgreSQL

# sudo -i -u postgres 
# psql -d intendo
CREATE ROLE student;
CREATE ROLE teacher;
ALTER ROLE student WITH PASSWORD 'SomePassword';
ALTER ROLE teacher WITH PASSWORD 'SomePassword';
ALTER ROLE student WITH LOGIN;
ALTER ROLE teacher WITH LOGIN;

GRANT CONNECT ON DATABASE intendo TO student;
GRANT USAGE ON SCHEMA public TO student;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO student;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO student;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO student;
REVOKE CREATE ON SCHEMA public FROM public;

GRANT CONNECT ON DATABASE intendo TO teacher;
GRANT USAGE ON SCHEMA public TO teacher;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO teacher;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO teacher;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO teacher;
GRANT CREATE ON SCHEMA public TO teacher;

Allow external connections

For this project I had to allow connections from outside the server.

Postgres (version 14)

  1. Open /etc/postgresql/14/main/postgresql.conf, comment the line that says listen_addresses = 'localhost', add listen_address = '*' below it.
  2. Open /etc/postgresql/14/main/pg_hba.conf, paste these lines at the end
# TYPE  DATABASE        USER    ADDRESS         METHOD
host    all             all     0.0.0.0/0       md5
host    all             all     :/0             md5
  1. Run sudo systemctl restart postgresql.
  2. Open the port with sudo ufw allow 5432/tcp.

Changing database location

PostgreSQL

Create a new dir, stop the service and copy the PostgreSQL data directory to the new location.

sudo mkdir /postgres
sudo chown -R postgres:postgres /postgres
sudo systemctl stop postgresql
sudo rsync -av /var/lib/postgresql/ /postgres
sudo nano /etc/postgresql/14/main/postgresql.conf

Then search data_directory to /postgresql/14/main.

Finally start the service again.

Problems with MySQL/Maria and SQL Server

MySQL/MariaDB

From R, the RMariaDB package copies around 20 times slower than RMySQL. Using odbc for this SQL backend is not very efficient. Just use RMySQL.

When you backup, at least with MariaDB, mysqldump intendo > intendo-mysql.sql is hundreds of times faster than mysql intendo > intendo-mysql.sql.

SQL Server

Beware that SQL Server shows strange problems with Linux and Mac! For example, when I connect from a Mac, I experience random disconnections and the next error message:

nanodbc/nanodbc.cpp:4483: 00000: [Microsoft][ODBC Driver 18 for SQL Server]Communication link failure 

With Ubuntu I get this another error message and RStudio crashes:

terminate called after throwing an instance of 'nanodbc::database_error'

I had to divide my data and write in chunks of 50,000 rows or it just disconnects. PostgreSQL and MariaDB can write tables containing two million rows right away.

Therefore, my preference for DB backends is not arbitrary.

Cite this work

This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

BibTeX entry:

@misc{databases_pacha,
  title = {SQL Databases for Students and Educators},
  url = {https://databases.pacha.dev/},
  author = {Vargas, Mauricio},
  doi = {10.5281/zenodo.4136985},
  publisher = {Self-published},
  year = {2021},
  month = {Feb},
  note = {Accessed: Month DD, YYYY}
}

Access and support

If you need access to the online DBs (i.e., not configuring your local copy), send me an email to hello+spam [at] pacha.dev (replace with +databases for a faster reply) or a tweet to @pachadotdev).

I also have a Buy me a coffee profile. If you like my work, buy me a coffee, it shall be used to produce more of it. Thank you for your support!