Exporting Google Sheets database to Dato issues

Artin · November 10, 2022, 11:15pm

Hi everyone,

Longtime lurker, so I really appreciate Dato and this supportive community.

Hoping to get some help in the right direction to set up a data / (E)TL workflow from our Google Sheets “database” to DatoCMS.

To give you a brief example, our website is connected to DatoCMS, fetching the data and list the datasets as content like this:

A quick overview of our current process and what we want to achieve in the future with DatoCMS:

Current process

Data collecting and enrichment in Google Sheets
Collecting data manually
Each row represents an entry and dataset (e.g. a Place or Product on the website)
Columns consist of specific, mostly static data
Transfer from Google Sheets to DatoCMS
Manually going through each row in Google Sheets and look up (specific entry in a column) whether it is a dataset ready to be published on the website
If ready then go to DatoCMS backend, create a new Place or Product and copy paste all the data from Google Sheets into the customized user form of the specific collection
Upload pictures to the Place/Product manually as well

Desired process

Data collecting and enrichment in Google Sheets will be highly automated through API services and different approaches. This task is working well.
Transfer from Google Sheets to DatoCMS
Identifier column creating, updating, or delete corresponding data in DatoCMS

Current thoughts and developed approach

Our ETL (Extract-Transform-Load) process needs to be way more automated. The “E” part was already taken care of and goes live soon.

What we’re struggling with is the “TL” part, or “Desired process” step 2.

What is the best approach to exporting data from Google Sheets while keeping it as our source of data?
What tools could help?

Creation of new data

I developed some data creation and update scripts in node.js locally with VS Code - which worked ok.

I also developed a script which connects to our Google Sheet and retrieves all data of specific worksheets and stores them as an array or JSON. This way we can work with the data easily and are able to identify which datasets can be pushed/created to/in DatoCMS and which not.

I intend to use a primary key in our Google Sheet which we also want to push to DatoCMS, this way we would be able to identify whether the record was already pushed or not. This field would be set to “no duplicates allowed” in the DatoCMS model.

I think after the creation process is done, the best thing to do is to retrieve the internal ID from datoCMS of this record and insert it back into Google Sheets. This way I’d be able to identify the recordset easily for later updates and changes, correct?

How do I retrieve the data (record ID) after creation? Should I use our own primary key from Google Sheets and search all records to it and get back the DatoCMS record ID, is this the best approach?

What I am also struggling with is my linked data management. Here is a snippet of our entity relationship diagram (ERD), extracted from DatoCMS.

Let’s assume I want to create data (a new [Place] - see box in the upper left corner of the ERD).

I cannot just use the data as it is in our Google Sheet, correct?

grafik

Many of the attributes in this small extract of the Google Sheets are linked fields in our DatoCMS model schema. Of course this is necessary, because the model should be redundancy free (as all relational databases). But how do I solve this issue when uploading data to DatoCMS?

What is the best approach to do so? I came up with 2 ideas, but I’m not very happy with them - but also have no expertise in doing better.

1. idea: While starting the whole uploading process the first step would be to retrieve all existing tables for the linked fields and store them as arrays/temporary tables etc.

While looping through each dataset which will be pushed to DatoCMS I replace every value which is a linked field with the primary key related to it.

This means that the value MUST be correctly spelled, otherwise I’d face issues while uploading.

2. idea: fetch the tables for the linked values and paste them in Google Sheets from time to time. Use them as kind of a data validation list while performing the data entries maybe? I still didn’t figure that out exactly. But this approach is different from doing it on the fly while uploading. Maybe it’s also less prone to errors at the end.

The next question in regards to all the creation/update/deletion processes is how to use them and where to store those scripts.

I came up with an idea but also here I am not familiar with good practices.

My idea is to create a node.js script for each of the processes and store them as cron jobs on an external host. The master data admin would manually trigger them as desired. When triggered a connection to the Google Sheet will be established and the show goes on.

As I already figured out there is no way to integrate all necessary node.js libraries in Google Apps Script, correct? If yes I would be able to create a custom ribbon in Google Sheets and trigger the functions from there directly.

What is the best approach to do so? How do people/companies handle this kind of task?

Appreciate any input. Bit lost!

m.finamor · November 11, 2022, 10:38am

Hello @Artin and welcome to the community!

I’ll go over some of the specific questions, and then try to extrapolate to the more general ones as i go:

Assuming you are using Node, and you are inside an async function, you can simply:

const record = await client.items.create(...);
console.log(record.id); //will log the ID of the record just created.

A suggestion would be very similar to the process you are already doing, with some modifications:
Store the record ID as a new column in the google sheets document as you said, as well as a new “last_updated” data string column.
When parsing the object of rows in Node, you can know:

If a column has no ID, it needs to have a record created and have its ID stored
If a column has an ID, but a different “last_updated” data string from dato, you need to update that record
If a record ID is in the CMS and not in the object, it should be deleted (optional if you want to also handle deletions)

Now, for the most general and complex question: Managing the linked fields

It seems right here that most of the problems in the approach become more apparent: The attempt to keep two sources of truth (the CMS and the google sheets) both up to date (one reflecting the other) will generate a lot of complexity.

My general recommendation here, to avoid future errors and data duplication, would be the following:

This example will focus on the Product → Producer relationship, but it can be extrapolated to the other relationships.

The cleanest and less error prone solution here seems to be to split different models into different google sheets, instead of having all of the values in a single row in a single sheet.

So in this case you would have a Product and a Producer sheet, each one with their respective values and Record ID columns.

You could then run the procedure specified in the second response of this post on each sheet, which will guarantee that all of the models are up to date, and no duplicates are made.

The link field will then just become the ID of the row in the other sheet (and will also be this value on the DatoCMS field, just like a relational Database)

Artin · November 11, 2022, 9:59pm

Thank your very much for your quick and detailed reply as well as your ideas.
I have some questions.

Those are very good bullet points, especially the third one.
Our idea was to mark rows for deletion in Google Sheets and then executing the deletion procedure. As part the checking process it would be good to check, whether the Record ID is available in CMS, otherwise it will most likely throw an error.
Your idea on the other hand sounds like we need a script which takes alle Record IDs from Dato and check whether they still exist in our Google Sheets database, right?

So, just to be sure. By “models” you mean each “table” in the ERD screenshot, corret?
So each of these models should be exported to a different sheet in the Google Sheets file?
That would be the same as in an relational database I guess.
This should be done in a regular basis I guess - latest when tables have new entries for linked fields.
How is this being done? By coding node.js scripts or is there an easy export function in Dato?

I tried to understand those hints, but I am affraid I did not get your point here.
Background info: our Google Sheets database is our place to go to create new Places and Products. The user will type in “Berlin” or any other city in the column [city]. The user has no clue what the foreign key for a city is. The user just inputs all necessary information in human readable text strings, like it would be shown on the website - but just as a data recordset in Google Sheets.

Where will the transformation from typed in city name to city record ID taken from the city model happen?
Should we generally work with drop-down listboxes so that also only the correct names for all the linked fields will be chosen?

Really appreciate your input here again.

m.finamor · November 14, 2022, 2:02pm

No problem at all!
Continuing from your post:

This would definitely be a better approach for a very large number of records: Perhaps setting a convention ID on the sheet for a record that should be deleted. This way the script could be streamlined to delete a record if the ID is set to “DELETE” or any other pre-defined key.

Exactly, the ERD tables are generated based on the models in your project.

When it comes to exporting Data, you can use several methods, if you want to export only records from a given table (model) you can use this REST API call documented here: List all records - Record - Content Management API with a filter set to that models API Key.
You can also, if you prefer, use the GraphQL API instead, documented here: Overview — DatoCMS

The dropdown is a good way to make sure any typos won’t create new entries, and when it comes to mapping the name of the city to its foreign key (in this case, the foreign key being the record ID in Dato) you can once again use this REST API call: Overview — DatoCMS but instead of just using a Model filter (to filter for the cities model), also add a field value filter, and filter for the record having the “Name” filter equal to the one in the table. This will return the whole “City” record that has that name field value, including its ID