SQL Databases for Students and Educators

Pachá

2022-12-12

Update 2022-12-12: I’m using just PostgreSQL. I just had too many heaches with MariaDB/MySQL. Also, the server is quite modest and I prefer to provide good PostgreSQL access than bad access to PostgreSQL and MariaDB.

Update 2022-09-30: The SQL Server educational license expired. From now on, only PostgreSQL and MariaDB databases will be available.

About

Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

I’ve decided to host different light/medium size by using PostgreSQL. I tend to prefer PostgreSQL because it easier to configure on client and server side, without strange tricks (see the Problems with MySQL/MariaDB and SQL Server section).

If I can secure funding for this, I will make the same databases available under different backends (even MariaDB and MySQL at the same time). There are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

If you need access to the online DBs (i.e., not configuring your local copy) or you need technical support, send me an email to (replace with +databases for a faster reply).

You can contribute to keep it up with the hosting costs here:

Available Databases

Open Trade Statistics

This is the database that powers Open Trade Statistics and contains international trade data for commodities since 2002.

This database can be helpful for evaluating the impact of free trade agreements, and other policy relevant modelling. The database is a modified version of the United Nations Commodity Trade Statistics Database (UN COMTRADE) and is targeted for Latin American Universities, who often have limited or no access to the original resource.

Access these tables by adapting the next code:

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "tradestatistics",
  host = "databases.pacha.dev"
)

dbListTables(con)
##  [1] "distances"         "sections_colors"   "yr"               
##  [4] "yrc"               "yrp"               "yrpc"             
##  [7] "rtas"              "tariffs"           "gdp"              
## [10] "gdp_deflator"      "vaccine_inputs"    "commodities"      
## [13] "commodities_short" "yc"                "countries"        
## [16] "countries_colors"  "sections"
tbl(con, "yrpc") %>% glimpse()
## Rows: ??
## Columns: 7
## Database: postgres  [teacher@databases.pacha.dev:5432/tradestatistics]
## $ year                <int> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 20…
## $ reporter_iso        <chr> "abw", "abw", "abw", "abw", "abw", "abw", "abw", "…
## $ partner_iso         <chr> "ant", "arg", "atg", "aus", "aut", "bra", "brb", "…
## $ section_code        <chr> "99", "99", "99", "99", "99", "99", "99", "99", "9…
## $ commodity_code      <chr> "999999", "999999", "999999", "999999", "999999", …
## $ trade_value_usd_imp <dbl> 242559, 0, 346, 659, 2056, 71380, 1112, 95223, 102…
## $ trade_value_usd_exp <dbl> 38553, 2313, 1006, 419, 0, 0, 0, 0, 0, 0, 0, 40553…
dbDisconnect(con)

Base Hansard

A complete PostgreSQL database of the Canadian Hansard dataset, including supplementary data. It covers the Canadian Hansard from 1901 to 2019 and was obtained from LiPaD with the intention of providing a live SQL database for users interested in this data. The original site provides a SQL dump that requires you to configure PostreSQL on your own laptop or server.

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "basehansard",
  host = "databases.pacha.dev"
)

dbListTables(con)
##  [1] "andablog_entryimage"         "contactme_contact_msg"      
##  [3] "dilipadsite_datenav"         "dilipadsite_constituency"   
##  [5] "dilipadsite_basehansard"     "auth_group"                 
##  [7] "auth_group_permissions"      "auth_user_user_permissions" 
##  [9] "auth_user_groups"            "better500s_caughterror"     
## [11] "dilipadsite_basehansard_bak" "dilipadsite_blogger"        
## [13] "auth_user"                   "auth_permission"            
## [15] "dilipadsite_datenav_bak"     "dilipadsite_datepickle"     
## [17] "dilipadsite_parlsess"        "dilipadsite_member"         
## [19] "django_admin_log"            "django_migrations"          
## [21] "django_content_type"         "dilipadsite_position"       
## [23] "django_session"              "downtime_period"            
## [25] "taggit_templatetags_amodel"  "taggit_taggeditem"          
## [27] "taggit_tag"                  "andablog_entry"             
## [29] "django_site"                 "dilipadsite_party"
tbl(con, "dilipadsite_basehansard") %>% glimpse()
## Rows: ??
## Columns: 15
## Database: postgres  [teacher@databases.pacha.dev:5432/basehansard]
## $ basepk          <int> 2691572, 2691576, 514587, 14, 21, 4043, 514588, 3, 4, …
## $ hid             <chr> "ca.proc.d.1970-11-13.28.1.89", "ca.proc.d.1970-11-13.…
## $ speechdate      <date> 1970-11-13, 1970-11-13, 1918-05-20, 1901-02-06, 1901-…
## $ pid             <chr> "7f5ea10f-f4c5-4350-b707-6c6522149e5f", "intervention"…
## $ opid            <chr> "ca.m.3511", "", "ca.m.4097", "", "", "", "ca.m.4942",…
## $ speakeroldname  <chr> "Mr. Cafik:", "", "Sir GEORGE FOSTER:", "", "", "", "M…
## $ speakerposition <chr> "", "", "", "topic", "stagedirection", "", "", "topic"…
## $ maintopic       <chr> "GOVERNMENT ORDERS", "GOVERNMENT ORDERS", "SUPPLY.-", …
## $ subtopic        <chr> "PUBLIC ORDER (TEMPORARY MEASURES) ACT, 1970 PROVISION…
## $ speechtext      <chr> "Let me point out to the hon. member for Matane that I…
## $ speakerparty    <chr> "Liberal", "", "Conservative (1867-1942)", "", "", "",…
## $ speakerriding   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ speakername     <chr> "Norman Augustine Cafik", "An hon. Member:", "George G…
## $ speakerurl      <chr> "http://www.parl.gc.ca/parlinfo/Files/Parliamentarian.…
## $ subsubtopic     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
dbDisconnect(con)

Intendo

It’s a dummy database that originally we used at the My Organization’s First R Package and contains managerial information of a company called Intendo, which creates mobile games. The tables contain information for the 2015 year for all sorts of information for their only game: Super Jetroid. See the recent updated to this database at intendo.

This database can be helpful for workshops and showing how to create an R package focused on the particular needs of an organization, including easier data access, shared functions for data transformation and analysis, and a common look and feel for reporting.

Check the structure of the tables in the next preview. The tables ending with “_f” contain deliberated errors.

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "intendo",
  host = "databases.pacha.dev"
)

dbListTables(con)
##  [1] "sj_all_revenue_small_f"   "sj_users_daily_small"    
##  [3] "sj_user_summary_medium"   "sj_all_sessions_large_f" 
##  [5] "sj_users_daily_large_f"   "sj_all_sessions_xlarge_f"
##  [7] "sj_user_summary_xlarge_f" "sj_users_daily_xlarge_f" 
##  [9] "sj_all_revenue_small"     "sj_all_revenue_medium_f" 
## [11] "sj_users_daily_medium_f"  "sj_all_sessions_large"   
## [13] "sj_users_daily_large"     "sj_all_sessions_small_f" 
## [15] "sj_all_revenue_medium"    "sj_users_daily_medium"   
## [17] "sj_user_summary_large_f"  "sj_all_revenue_xlarge_f" 
## [19] "sj_all_sessions_small"    "sj_all_sessions_medium_f"
## [21] "sj_all_revenue_large_f"   "sj_user_summary_large"   
## [23] "sj_all_revenue_xlarge"    "sj_all_sessions_xlarge"  
## [25] "sj_user_summary_xlarge"   "sj_users_daily_xlarge"   
## [27] "sj_user_summary_small_f"  "sj_all_sessions_medium"  
## [29] "sj_all_revenue_large"     "sj_user_summary_small"   
## [31] "sj_users_daily_small_f"   "sj_user_summary_medium_f"
tbl(con, "sj_user_summary_small") %>% glimpse()
## Rows: ??
## Columns: 6
## Database: postgres  [teacher@databases.pacha.dev:5432/intendo]
## $ player_id   <chr> "YMKOHGVFZWLJ836", "SBPFOHCVMNQI568", "DLCKJEZRSIGW561", "…
## $ first_login <dttm> 2015-01-01 01:45:33, 2015-01-01 10:13:16, 2015-01-02 14:1…
## $ start_day   <date> 2015-01-01, 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-0…
## $ country     <chr> "United States", "Japan", "United States", "India", "Switz…
## $ acquisition <chr> "organic", "apple", "organic", "organic", "google", "apple…
## $ device_name <chr> "Samsung Galaxy A7", "Sony Experia Z1 Compact", "Sony Expe…
dbDisconnect(con)

Financial (aka Loan Application)

A database that contains Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.

This database comes from the Relational Dataset Repository where you can find the database schema and more information. The database full name is “PKDD’99 Financial Dataset or Loan Application”, and this particular version corresponds to Financial_ijs by Janez Kranjc.

Access these tables by adapting the next code:

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "financial",
  host = "databases.pacha.dev"
)

dbListTables(con)
## [1] "accounts"  "cards"     "clients"   "disps"     "districts" "loans"    
## [7] "orders"    "tkeys"     "trans"
tbl(con, "clients") %>% glimpse()
## Rows: ??
## Columns: 4
## Database: postgres  [teacher@databases.pacha.dev:5432/financial]
## $ id           <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ birth_number <chr> "706213    ", "450204    ", "406009    ", "561201    ", "…
## $ district_id  <int64> 18, 1, 1, 5, 5, 12, 15, 51, 60, 57, 57, 40, 54, 76, 21,…
## $ tkey_id      <int64> NA, 21, 21, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

SQL dumps

All the dumps are available for PostgreSQL to promote its use. I had just too many problems with MariaDB/MySQL configuration and even more with SQL Server (see the Problems with MySQL/Maria and SQL Server section).

Open Trade Statistics

Intendo

Financial

Using the dumps locally

Restore dumps

PostgreSQL

sudo -i -u postgres
pg_restore intendo.sql -d intendo

Create users

You can create a generic user (let’s say student) and grant read-only access.

PostgreSQL

# sudo -i -u postgres 
# psql -d intendo
CREATE ROLE student;
CREATE ROLE teacher;
ALTER ROLE student WITH ENCRYPTED PASSWORD 'SomePassword';
ALTER ROLE teacher WITH ENCRYPTED PASSWORD 'SomePassword';
ALTER ROLE student WITH LOGIN;
ALTER ROLE teacher WITH LOGIN;

GRANT CONNECT ON DATABASE intendo TO student;
GRANT USAGE ON SCHEMA public TO student;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO student;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO student;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO student;
REVOKE CREATE ON SCHEMA public FROM public;

GRANT CONNECT ON DATABASE intendo TO teacher;
GRANT USAGE ON SCHEMA public TO teacher;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO teacher;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO teacher;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO teacher;
GRANT CREATE ON SCHEMA public TO teacher;

Allow external connections

For this project I had to allow connections from outside the server.

Postgres (version 14)

  1. Open /etc/postgresql/14/main/postgresql.conf, comment the line that says listen_addresses = 'localhost', add listen_address = '*' below it.
  2. Open /etc/postgresql/14/main/pg_hba.conf, paste these lines at the end
# TYPE  DATABASE        USER    ADDRESS         METHOD
host    all             all     0.0.0.0/0       md5
host    all             all     :/0             md5
  1. Run sudo systemctl restart postgresql.
  2. Open the port with sudo ufw allow 5432/tcp.

Changing database location

PostgreSQL

Create a new dir, stop the service and copy the PostgreSQL data directory to the new location.

sudo mkdir /postgres
sudo chown -R postgres:postgres /postgres
sudo systemctl stop postgresql
sudo rsync -av /var/lib/postgresql/ /postgres
sudo nano /etc/postgresql/14/main/postgresql.conf

Then search data_directory and change the location to /postgresql/14/main.

Finally start the service again.

Problems with MySQL/Maria and SQL Server

MySQL/MariaDB

From R, the RMariaDB package copies around 20 times slower than RMySQL. Using odbc for this SQL backend is not very efficient. Just use RMySQL.

When you backup, at least with MariaDB, mysqldump intendo > intendo-mysql.sql is hundreds of times faster than mysql intendo > intendo-mysql.sql.

SQL Server

Beware that SQL Server shows strange problems with Linux and Mac! For example, when I connect from a Mac, I experience random disconnections and the next error message:

nanodbc/nanodbc.cpp:4483: 00000: [Microsoft][ODBC Driver 18 for SQL Server]Communication link failure 

With Ubuntu I get this another error message and RStudio crashes:

terminate called after throwing an instance of 'nanodbc::database_error'

I had to divide my data and write in chunks of 50,000 rows or it just disconnects. PostgreSQL and MariaDB can write tables containing two million rows right away.

Therefore, my preference for DB backends is not arbitrary.

Cite this work

This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

BibTeX entry:

@misc{databases_pacha,
  title = {SQL Databases for Students and Educators},
  url = {https://databases.pacha.dev/},
  author = {Vargas, Mauricio},
  doi = {10.5281/zenodo.4136985},
  publisher = {Self-published},
  year = {2021},
  month = {Feb},
  note = {Accessed: Month DD, YYYY}
}