SQL Databases for Students and Educators

Pachá

2024-04-08

Update 2022-12-12: I’m using just PostgreSQL. I just had too many heaches with MariaDB/MySQL. Also, the server is quite modest and I prefer to provide good PostgreSQL access than bad access to PostgreSQL and MariaDB.

Update 2022-09-30: The SQL Server educational license expired. From now on, only PostgreSQL and MariaDB databases will be available.

About

Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

I’ve decided to host different light/medium size by using PostgreSQL, MariaDB and SQL Server. I tend to prefer PostgreSQL because it easier to configure on client and server side, without strange tricks (see the Problems with MySQL/MariaDB and SQL Server section).

There are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

If you need access to the databases or need a SQL dump , send me an email to (replace with +databases for a faster reply).

You can contribute to keep it up with the hosting costs here:

Available Databases

Intendo

It’s a dummy database that originally we used at the My Organization’s First R Package and contains managerial information of a company called Intendo, which creates mobile games. The tables contain information for the 2015 year for all sorts of information for their only game: Super Jetroid. See the recent updated to this database at intendo.

This database can be helpful for workshops and showing how to create an R package focused on the particular needs of an organization, including easier data access, shared functions for data transformation and analysis, and a common look and feel for reporting.

Check the structure of the tables in the next preview. The tables ending with "_f" contain deliberated errors.

library(dplyr)
library(RPostgres)

con <- dbConnect(
  Postgres(),
  user = Sys.getenv("dbedu_usr"),
  password = Sys.getenv("dbedu_pwd"),
  dbname = "intendo",
  host = "databases.pacha.dev"
)

dbListTables(con)
##  [1] "sj_all_sessions_medium"   "sj_users_daily_large"    
##  [3] "sj_users_daily_large_f"   "sj_user_summary_medium"  
##  [5] "sj_all_sessions_large"    "sj_all_revenue_medium_f" 
##  [7] "sj_all_revenue_large_f"   "sj_users_daily_medium"   
##  [9] "sj_user_summary_small"    "sj_all_revenue_small_f"  
## [11] "sj_all_sessions_medium_f" "sj_all_revenue_medium"   
## [13] "sj_all_sessions_small_f"  "sj_user_summary_medium_f"
## [15] "sj_all_sessions_large_f"  "sj_all_revenue_large"    
## [17] "sj_user_summary_small_f"  "sj_users_daily_small"    
## [19] "sj_user_summary_large"    "sj_users_daily_small_f"  
## [21] "sj_users_daily_medium_f"  "sj_all_sessions_small"   
## [23] "sj_user_summary_large_f"  "sj_all_revenue_small"
tbl(con, "sj_user_summary_small") %>%
  glimpse()
## Rows: ??
## Columns: 6
## Database: postgres  [teacher@databases.pacha.dev:5432/intendo]
## $ player_id   <chr> "YMKOHGVFZWLJ836", "SBPFOHCVMNQI568", "DLCKJEZRSIGW561", "…
## $ first_login <dttm> 2015-01-01 01:45:33, 2015-01-01 10:13:16, 2015-01-02 14:1…
## $ start_day   <date> 2015-01-01, 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-0…
## $ country     <chr> "United States", "Japan", "United States", "India", "Switz…
## $ acquisition <chr> "organic", "apple", "organic", "organic", "google", "apple…
## $ device_name <chr> "Samsung Galaxy A7", "Sony Experia Z1 Compact", "Sony Expe…
dbDisconnect(con)

Problems with MySQL/Maria and SQL Server

MySQL/MariaDB

From R, the RMariaDB package copies around 20 times slower than RMySQL. Using odbc for this SQL backend is not very efficient. Just use RMySQL.

When you backup, at least with MariaDB, mysqldump intendo > intendo-mysql.sql is hundreds of times faster than mysql intendo > intendo-mysql.sql.

SQL Server

Beware that SQL Server shows strange problems with Linux and Mac!

When I connect from a Mac, I experience random disconnections and the next error message:

nanodbc/nanodbc.cpp:4483: 00000: [Microsoft][ODBC Driver 18 for SQL Server]Communication link failure 

With Ubuntu I get this another error message and RStudio crashes:

terminate called after throwing an instance of 'nanodbc::database_error'

I had to divide my data and write in chunks of 50,000 rows or it just disconnects. PostgreSQL and MariaDB can write tables containing two million rows right away.

Therefore, my preference for PostgreSQL is not arbitrary.

Cite this work

This work is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

BibTeX entry:

@misc{databases_pacha,
  title = {SQL Databases for Students and Educators},
  url = {https://databases.pacha.dev/},
  author = {Vargas, Mauricio},
  doi = {10.5281/zenodo.4136985},
  publisher = {Self-published},
  year = {2021},
  month = {Feb},
  note = {Accessed: Month DD, YYYY}
}