Summary
- My experience with Carto.com in creating web map for data analysis
- I wrote a R package to wrap Carto.com API calls
- Some notes on my experience of managing Gigabyte size data for mapping
Introduction
Carto.com is a web map provider. I used Carto in my project because:
- With PostgreSQL, PostGIS as backend, you have all the power of SQL and PostGIS functions. With Mapbox you will need to do everything in JavaScript. Because you can run SQL inside the Carto website UI, it’s much easier to experiment and update.
- The new Builder let user to create widgets for map, which let map viewers select range in date or histgram, value in categorical variable, and the map will update dynamically.
Carto provide several types of API for different tasks. It’s simple to construct an API call with curl
but also very cumbersome. You also often need to use some parts of the request response, which means a lot of copy/paste. I try to replace all repetitive manual labor with programs as much as possible, so it’s only natural to do this with R.
There are some R package or function available for Carto API but they are either too old and broken or too limited for my usage. I developed my own R functions for every API call I used gradually, then I made it into a R package - RCartoAPI.
- upload local file to Carto
- let Carto import a remote file by url
- let Carto sync with a remote file
- check sync status
- force sync
- remove sync connection
- list all sync tables
- run SQL inquiry
- run time consuming SQL inquiry in Batch mode, check status later
So it’s more focused on data import/sync and time consuming SQL inquiries. I have found it saved me a lot of time.
Carto user name and API key
All the functions in the package currently require an API key from Carto. Without API key you can only do some read only operations with public data. If there is more demand I can add the keyless versions, though I think it will be even better for Carto to just provide API key in free plan.
It’s not easy to save sensitive information securely and conveniently at the same time. After checking this summary and the best practices vignette from httr
, I chose to save them in system environment and minimize the exposure of user name and API key. After reading from system environment, the user name and API key only exist inside the package functions, which are further wrapped in package environment, not visible from global environment.
Most references I found in this usage used .Rprofile
, while I think .Renviron
is more suitable for this need. If you want to update variables and reload them, you don’t need to touch the other part in .Rprofile
.
When package is loaded it will check system environment for the user name and API key and report status. If you modified the user name and API key in .Renviron
, just run update_env()
.
Some tips from my experience
csv column type guessing
Carto by default will set csv column type according to column content. However sometimes column with numbers are actually categorical, and often there are leading 0s need to be kept. If Carto import these columns as number, the leading 0 information is lost and you cannot recover it by changing column type later in Carto.
Thus I will add quote for the columns that I want to keep them as characters, and use parameter quoted_fields_guessing
as FALSE by default. Then Carto will not guessing type for these columns. We still want the field guessing on for other columns, especially it’s easier that Carto recognize lon/lat pair and build the geom automatically. write.csv
will write non-numeric columns with quote by default, which is what we want. If you are using fwrite
in data.table
, you need to set quote = TRUE
manually.
update data after a map is created
Sometimes I may want to update the data used in a map after the map has been created, for example there are more data cleaning needed. I didn’t find a straightforward way to do this in Carto.
- One way is to upload the new data file with new name, then duplicate the map, change the SQL call for the data set to load the new data table. There are multiple manual steps involved, and there will be duplicated maps and data sets.
- Another way is to set map using a sync table to a remote url, for example dropbox shared file. Then you can update the file in dropbox, let Carto to update the data. If the default sync interval is too long, there is
force_sync
function in package to force immediate sync. Note there is a 15 mins wait from last sync before force sync can work.
It also worth note that by copying new version of data file into the local dropbox folder to override the old version will update the file and keep the sharing link same.
upload large file to Carto
There is a limit of 1 million rows for single file upload to Carto. I have a data file with 4 million rows, so I have to split it into smaller chunks, upload each file, then combine them with SQL inquries. With the help of rdrop2
package and my own package, I can do all of these automatically, which make it much easier to update the data and run the process again.
Compare to upload huge local file directly to Carto, I think upload to cloud probably is more reliable. I chose dropbox because the direct file link can be inferred from the share link, while I didn’t find a working method to get direct link of google drive file.
To run the code below you need to provide a data set. Then the verification part may need some column adjustment to pass.
|
My case need to upload 4 200M files. Any error in the network or Carto server may prevent it finish perfectly. Upon checking the sync table I found the last file sync is not successful. I tried force sync it but failed, so I just use this code to upload and sync that file again.
|
merge uploaded chunks with Batch sql
With all data files uploaded to Carto, now we need to merge them. Because I tested with small size sample first, I can test my sql inquiry in the web page directly (click a data set to open the data view, switch to sql view to run sql inquiry). After that I run the sql inquiry with my R package. With everything works I change the data set to the full scale data and run the whole process again.
I used a template for sql inquiries because I need to apply them for small sample file first, then larger full scale file later. With a template I can change the table name easily.
Carto expect a table matching some special schema to work, including a cartodb_id
column. When you upload a file into Carto, Carto will convert the data automatically in the importing process. Since we are creating a new table by sql API directly, this new table didn’t go through that process and is not ready for Carto mapping yet. We need to drop the cartodb_id
column, run cdb_cartodbfytable
function to make the table ready. Only after this finished you can see the result table in the data set page of Carto.
The sql inquiries we used here need some time to finish. With rCartoAPI you can run the inquiries and check the job status easily.
|
After this I can create map with the merged data set. However the map performance is not ideal. I learned that you can create overviews to improve performance in this case.
So I can drop the overviews for the uploaded chunks, which were created automatically in importing process but we don’t need it. Then create overview for the merged table.
|
Later I found I want to add a year column that work as categorical instead of numerical. Even this simple process is very slow for table this large. I have to use Batch sql inquiry for this. I also need to update the overview for the table after this change to data.
|