More API control or direct backup/restore ability for running data migrations

ryan.dunn · November 1, 2024, 11:41am

Hi there,

Over the past month or so I’ve been managing a fairly complex migration in our Dato setup and running into some particular pain points. It seems like there could be fairly simple solutions that would make these easier to deal with. Roughly what my requirements have been are this:

We want to change our existing data structure within Dato to an improved one supporting our newer business requirements.
7-8 models with numerous linked fields between them. Eg: categories, regions, etc. These need to be tweaked and new supporting models and changed fields added.
Around 30,000 existing records that need specific conditional changes, updates and references to different models and records. Eg: splitting a particular text field into two, adding conditional switches with newer fields, changing a ‘key’ string field to a linked field to another model, etc.
Records are being created, edited and moved around all the time. The migration and switchover needs to happen seamlessly without losing the exactly state of the environment.
Updated date matters for ordering main records.
Record ID’s are attached to data stored outside of DatoCMS in our own tables. Eg: views are stored elsewhere to measure performance in a BI report.

The ideal solution here would be if I could access the underlying database, or tables for an environment, that would make all of this fairly trivial to achieve. But without that I have a couple of approaches that have advantages and disadvantages. The pain points are these:

The simplest way to achieve this is to fork the environment and make modifications to it, but that means our ability to have seamless usage is impacted. The modifications are to each record with updating and publishing with so many it takes more than a full day to run the script.
When we run the updates to a models records, all the last updated times become instantly ‘today’ on publish. We can keep the order by precisely running through them in ascending order, but when a record was really previously updated is lost. Not a disastrous data loss, but quite annoying. Again there isn’t any way to directly set this data on a record. It also means it has to run record by record rather than batched making it significantly slower.

The simplest, easiest thing for me would be allowing for a database backup/restore feature, in whatever format was secure enough to be viable. Just being able to mass-edit the data in some way would solve everything instantly. The next-best thing would be some kind of support for creating or editing records with all of the metadata in-tact.

I realize this is pretty niche, but those things would make my life easier.

Edit: For some reason I thought I couldn’t create records with a particular ID, removed references to this. This is one of my pain points removed but would still appreciate more direct control of editing.

roger · November 1, 2024, 2:20pm

Hey @ryan.dunn,

Thank you for the detailed thoughts! I agree that having a good export/import system (akin to a SQL dump) would make batch exports, edits, and re-imports a lot easier. You’re not the first customer to have requested that, and I also wish we offered that. Let’s leave the feature request because it’s a really solid argument (but please remember to vote on it!).

But for now, I’m afraid that the API calls are the only ways to do this sort of thing. Yes, it’s a huge hassle compared to being able to modify tables directly

I did want to at least try to address some of the particular pain points, though:

IDs: You should be able to create a record with your own specified id using the id parameter: https://www.datocms.com/docs/content-management-api/resources/item/create. It’s not super clear in the examples there, but it should take the ID you give it as long as it conforms to our ID formats (either an int, for older projects, or a Dato-specific UUID):

Inside our cma-client lib, there are two utility functions that might help: isValidId() is a boolean check on whether an ID will be accepted by our systems, and generateId() will make a new Dato-compatible ID. But if you just re-use the IDs from the other environment, they should already be valid.
About the updated time, yes, unfortunately that’s not editable But can you make a separate migrated_timestamp datetime field that just has the old record’s updated datetime, and have the frontend or GraphQL API sort on that field when it’s available? (It’s basically your old_id idea, except it would be for the real last updated time)
As for the general slowness (with 30,000 records), are you using any sort of parallel/concurrent requests right now? The CMA should allow 60 requests / 3 seconds (or let’s say 20 reqs/sec for simplicity). If you fully saturate that limit, it should only take about ~25 min to send 30,000 separate requests (at the worst case). Libs like p-throttle should make that pretty easy.

None of these are meant to dismiss your underlying concern (that it’s hard to bulk edit mass tens of thousands of records and sync up schemas). I really am sorry that it’s difficult I am just hoping the above can make it a little less painful for the time being, but let’s also definitely leave the feature request open and hope that the devs take note!

ryan.dunn · November 1, 2024, 2:55pm

Hi @roger thanks for your quick reply.

I have just discovered that I had been doing the migrated ID wrong and that it is possible, which is helping a lot.

The updated time I think I’m just going to republish each record in sequence to preserve the order to be honest. It’s not a perfect solution but also not a big problem. As you say the main benefit would be from a more direct way of editing the entire dataset at once.

ryan.dunn · November 4, 2024, 9:29am

@roger
Having investigated this further, I have discovered another dealbreaker here. Our records go back to before when the ID format was changed so I cannot create new duplicated records with those ID’s, the system rejects the format. I also can’t seem to move a record between models with identical fields, which I honestly wasn’t really expecting to work anyway, but it would have made a fork → migrate pattern viable. I think that’s the end of my options and I’m going to have to implement an “oldId” field with all new records and write the logic for handling multiple into our front-end and reporting system. Not ideal.

I understand why handling low-level db access might be difficult from a validation perspective, but a more bulk-friendly endpoint for migrations that allows for closer metadata importing would really make this more viable for us at scale.

Thanks for your help, I hope this gets considered.

roger · November 4, 2024, 7:04pm

Shoot, @ryan.dunn, I’m sorry. I’ll check with the devs for you on this part… that definitely complicates things

Me too. I’m sorry for the hoops here.

roger · November 5, 2024, 5:19pm

@ryan.dunn,

Chatted with the devs on this, and unfortunately there is no way to designate an integer ID for a new record The ability to specify your own ID was actually only launched for the new UUIDs, so it was never possible to choose your own integer IDs, apparently.

I know that this is a lot of work (and let’s definitely leave the feature request here, because it’s a common enough request and hopefully will get more votes).

One last thing to consider, though: If it’s at all practical, it might be easier to do a one-time migration on your existing project (into a new sandbox environment or such) to re-create all your existing records into UUIDs. If you’re going to have to write a giant migration script anyway, adding that step might make ongoing sync easier, at the expense of having to adapt/test your frontend against the new IDs.

It’s still a hassle, but hopefully then it’s just a one-time hassle vs the ongoing headache of having to manually dedupe two IDs for each record…