rCartoAPI - call Carto.com API with R | From Learning and Evolution to Data Science

Summary

My experience with Carto.com in creating web map for data analysis
I wrote a R package to wrap Carto.com API calls
Some notes on my experience of managing Gigabyte size data for mapping

Introduction

Carto.com is a web map provider. I used Carto in my project because:

With PostgreSQL, PostGIS as backend, you have all the power of SQL and PostGIS functions. With Mapbox you will need to do everything in JavaScript. Because you can run SQL inside the Carto website UI, it’s much easier to experiment and update.
The new Builder let user to create widgets for map, which let map viewers select range in date or histgram, value in categorical variable, and the map will update dynamically.

Carto provide several types of API for different tasks. It’s simple to construct an API call with curl but also very cumbersome. You also often need to use some parts of the request response, which means a lot of copy/paste. I try to replace all repetitive manual labor with programs as much as possible, so it’s only natural to do this with R.

There are some R package or function available for Carto API but they are either too old and broken or too limited for my usage. I developed my own R functions for every API call I used gradually, then I made it into a R package - RCartoAPI.

upload local file to Carto
let Carto import a remote file by url
let Carto sync with a remote file
check sync status
force sync
remove sync connection
list all sync tables
run SQL inquiry
run time consuming SQL inquiry in Batch mode, check status later

So it’s more focused on data import/sync and time consuming SQL inquiries. I have found it saved me a lot of time.

Carto user name and API key

All the functions in the package currently require an API key from Carto. Without API key you can only do some read only operations with public data. If there is more demand I can add the keyless versions, though I think it will be even better for Carto to just provide API key in free plan.

It’s not easy to save sensitive information securely and conveniently at the same time. After checking this summary and the best practices vignette from httr, I chose to save them in system environment and minimize the exposure of user name and API key. After reading from system environment, the user name and API key only exist inside the package functions, which are further wrapped in package environment, not visible from global environment.

Most references I found in this usage used .Rprofile, while I think .Renviron is more suitable for this need. If you want to update variables and reload them, you don’t need to touch the other part in .Rprofile.

When package is loaded it will check system environment for the user name and API key and report status. If you modified the user name and API key in .Renviron, just run update_env().

Some tips from my experience

csv column type guessing

Carto by default will set csv column type according to column content. However sometimes column with numbers are actually categorical, and often there are leading 0s need to be kept. If Carto import these columns as number, the leading 0 information is lost and you cannot recover it by changing column type later in Carto.

Thus I will add quote for the columns that I want to keep them as characters, and use parameter quoted_fields_guessing as FALSE by default. Then Carto will not guessing type for these columns. We still want the field guessing on for other columns, especially it’s easier that Carto recognize lon/lat pair and build the geom automatically. write.csv will write non-numeric columns with quote by default, which is what we want. If you are using fwrite in data.table, you need to set quote = TRUE manually.

update data after a map is created

Sometimes I may want to update the data used in a map after the map has been created, for example there are more data cleaning needed. I didn’t find a straightforward way to do this in Carto.

One way is to upload the new data file with new name, then duplicate the map, change the SQL call for the data set to load the new data table. There are multiple manual steps involved, and there will be duplicated maps and data sets.
Another way is to set map using a sync table to a remote url, for example dropbox shared file. Then you can update the file in dropbox, let Carto to update the data. If the default sync interval is too long, there is force_sync function in package to force immediate sync. Note there is a 15 mins wait from last sync before force sync can work.

It also worth note that by copying new version of data file into the local dropbox folder to override the old version will update the file and keep the sharing link same.

upload large file to Carto

There is a limit of 1 million rows for single file upload to Carto. I have a data file with 4 million rows, so I have to split it into smaller chunks, upload each file, then combine them with SQL inquries. With the help of rdrop2 package and my own package, I can do all of these automatically, which make it much easier to update the data and run the process again.

Compare to upload huge local file directly to Carto, I think upload to cloud probably is more reliable. I chose dropbox because the direct file link can be inferred from the share link, while I didn’t find a working method to get direct link of google drive file.

To run the code below you need to provide a data set. Then the verification part may need some column adjustment to pass.

library(data.table)
# setup rdrop2
devtools::install_github('karthik/rdrop2')
library(rdrop2)
drop_auth()
# provide your data set here
target <- data.table(dataset)
# use small size to test workflow first, change to full scale later
chunk_size <- 200
name_prefix <- "bfa_sample"
file_count <- ceiling(target[, .N] / chunk_size)
# generate this to be used later. note no ".csv" part here
file_name_list <- paste0(name_prefix, "_", 1:file_count)
for (i in 1:file_count) {
  range_s <- (i - 1) * chunk_size + 1
  # the last chunk could be of different size. R will recycle rows if not specified
  range_e <- min(target[, .N], range_s + chunk_size - 1)
  save_csv(target[range_s:range_e], file_name_list[i])
}
# verify split data integrity
file_list <- paste0(csv_folder, file_name_list, ".csv")
dt_list <- vector("list", length(file_list))
for (j in seq_along(file_list)) {
  dt_list[[j]] <- fread(file_list[[j]])
}
dt <- rbindlist(dt_list)
# in reality, some columns types need to be converted first after reading from csv directly
all.equal(dt, target)
# setup dropbox, get url.
file_urls <- vector(mode = "character", length = length(file_list))
for (i in seq_along(file_list)) {
  drop_upload(file_list[i])
  res <- drop_share(drop_search(file_name_list[i])$path, short_url = FALSE)
  file_urls[i] <- res$url
}
# setup dropbox sync, wait complete, get table id
for (i in seq_along(file_urls)) {
  res <- url_sync(convert_dropbox_link(file_urls[i]))
}
# check result
tables_df <- list_sync_tables_df()

My case need to upload 4 200M files. Any error in the network or Carto server may prevent it finish perfectly. Upon checking the sync table I found the last file sync is not successful. I tried force sync it but failed, so I just use this code to upload and sync that file again.

# need both file_path and file_name
file_path <- "your file path"
file_name <- "your file name"
drop_upload(file_path)
res <- drop_share(drop_search(file_name)$path, short_url = FALSE)
file_url <- res$url
# setup dropbox sync, wait complete, get table id
res <- url_sync(convert_dropbox_link(file_url))
dt <- list_sync_tables_dt()

merge uploaded chunks with Batch sql

With all data files uploaded to Carto, now we need to merge them. Because I tested with small size sample first, I can test my sql inquiry in the web page directly (click a data set to open the data view, switch to sql view to run sql inquiry). After that I run the sql inquiry with my R package. With everything works I change the data set to the full scale data and run the whole process again.

I used a template for sql inquiries because I need to apply them for small sample file first, then larger full scale file later. With a template I can change the table name easily.

Carto expect a table matching some special schema to work, including a cartodb_id column. When you upload a file into Carto, Carto will convert the data automatically in the importing process. Since we are creating a new table by sql API directly, this new table didn’t go through that process and is not ready for Carto mapping yet. We need to drop the cartodb_id column, run cdb_cartodbfytable function to make the table ready. Only after this finished you can see the result table in the data set page of Carto.

The sql inquiries we used here need some time to finish. With rCartoAPI you can run the inquiries and check the job status easily.

# pattern of uploaded file name
file_name_pattern <- "data_set"
tables_dt <- list_sync_tables_dt()
# get the full table name for uploaded files in Carto
file_name_list <- tables_dt[order(name)][str_detect(name, file_name_pattern), name]
result_table <- "data_set_all"
# inquiries in two parts
inquiry_list <- vector(mode = "character", length = 2)
# merge the table, cartodb_id column need to be dropped and generated again for merged dataset, because it is a row id column.
inquiry_list[1] <- "DROP TABLE IF EXISTS __result_table;
CREATE TABLE __result_table AS 
  SELECT * FROM __table_1
  union
  SELECT * FROM __table_2
  union
  SELECT * FROM __table_3
  union
  SELECT * FROM __table_4
  union
  SELECT * FROM __table_5;
ALTER TABLE __result_table
  DROP COLUMN cartodb_id; "
# make a plain table ready for Carto. need your Carto user name here
inquiry_list[2] <- "select cdb_cartodbfytable('your user name', '__result_table')"
# str_replace_all named pair of pattern:replacement.
inq <- lapply(inquiry_list, function(x) str_replace_all(x, 
                c("__result_table" = result_table, 
                  "__table_1" = file_name_list[1], 
                  "__table_2" = file_name_list[2],
                  "__table_3" = file_name_list[3], 
                  "__table_4" = file_name_list[4],
                  "__table_5" = file_name_list[5])))
# run batch job 1, merge tables
job <- sql_batch_inquiry_id(inq[[1]])
sql_batch_check(job)
# check merging result
sql_inquiry_dt("select * from data_set_all limit 2")
sql_inquiry_dt("select count(*) from data_set_all")
# run batch job 2, cartodbfy
job_2 <- sql_batch_inquiry_id(inq[[2]])
sql_batch_check(job_2)
# check result
sql_inquiry_dt("select * from data_set_all limit 2")

After this I can create map with the merged data set. However the map performance is not ideal. I learned that you can create overviews to improve performance in this case.

So I can drop the overviews for the uploaded chunks, which were created automatically in importing process but we don’t need it. Then create overview for the merged table.

# optimization for big table
sql_inquiry("select cdb_dropoverviews('table_1'); ")
sql_inquiry("select cdb_dropoverviews('table_2'); ")
sql_inquiry("select cdb_dropoverviews('table_3'); ")
sql_inquiry("select cdb_dropoverviews('table_4'); ")
sql_inquiry("select cdb_dropoverviews('table_5'); ")
job_4 <- sql_batch_inquiry_id("select cdb_createoverviews('data_set_all'); ")
sql_batch_check(job_4)

Later I found I want to add a year column that work as categorical instead of numerical. Even this simple process is very slow for table this large. I have to use Batch sql inquiry for this. I also need to update the overview for the table after this change to data.

# add year categorical
job_5 <- sql_batch_inquiry_id("alter table data_set_all
  add column by_year varchar(25)")
sql_batch_check(job_5)
job_6 <- sql_batch_inquiry_id("update data_set_all
  set by_year = to_char(year, '9999')")
sql_batch_check(job_6)
# run overview again
sql_inquiry("select cdb_dropoverviews('data_set_all'); ")
job_7 <- sql_batch_inquiry_id("select cdb_createoverviews('data_set_all'); ")
sql_batch_check(job_7)