Uploads via createUpload is crazy slow

tim1 · March 14, 2024, 8:23am

After getting nowhere with this topic as an alternative I tried adding the files to Cloudinary and then using the URL to upload to Dato.

However, whilte the Cloudinary side of the equation is very fast, the Dato side is crazy slow. I’ve just run a test using a local node script and to upload one image that is 2.2MB took just over 27 seconds to return a response!

The URL of the image I used for the test is: https://res.cloudinary.com/dn4jigvha/image/upload/v1710403002/upql03duooybcazkgmys.jpg

That seems ridiculously slow for a file of that size. Is this API meant to be that slow?

In case it’s useful the script is just the basic one from the documentation:

import { buildClient } from "@datocms/cma-client-node";
const client = buildClient({ apiToken: "XXXXXXXXXX" });

async function run() {
  //Print file name
  const upload = await client.uploads.createFromUrl({
    // local path of the file to upload
    url: "https://res.cloudinary.com/dn4jigvha/image/upload/v1710403002/upql03duooybcazkgmys.jpg",
    // skip the upload and return an existing resource if it's already present in the Media Area:
    skipCreationIfAlreadyExists: false,
    // be notified about the progress of the operation.
    onProgress: handleProgress,
    // specify some additional metadata to the upload resource
  });
  console.log(upload);
}

run();

function handleProgress(info) {
  console.log("Phase:", info.type);
  console.log("Details:", info.payload);
}

nroth · March 19, 2024, 6:11pm

I noticed a similar issue. It seems like dato has a general purpose approach to file uploads to handle any file size, but it is super slow for simple image uploads. The fastest I’ve seen with small images is around 5 seconds and it often can be in the 10 seconds range.

Here is a broader request around native support of image uploads by url that discusses.

roger · March 21, 2024, 10:52pm

@tim1,

I’m so sorry I missed this (and your other thread, which I’ll dig back into momentarily!). I was out sick for a bit and they completely fell through the cracks… my apologies there!

To clarify this situation, when you specify a remote URL, we do not actually fetch it from that remote URL directly to our servers. The JS client actually downloads it to a temp file on your own computer, and then reuploads it DatoCMS over your connection. So the “transfer” speed will be bottlenecked by your home internet’s upload speed. I know this is not ideal I was surprised when I first discovered that too.

I think @nroth’s feature request is a good one: Native Upload from Image URL Support

In the meantime, I think the only way you could speed it up is by using some other server (or maybe a serverless function) in the middle, with its own fast internet connection, that will download a remote image, cache it locally (in memory or on disk) and then reupload it to Dato. Yeah, it’s a pain, sorry about that I would also like to see a server-to-server fetch where you can just provide a URL and our servers grab it directly. That would be much simpler for all involved. I hope that feature request gets some traction!

tim1 · March 21, 2024, 11:45pm

Thanks @roger. Yeah it’s a bit of a bugger.

\I actually did have this running in a serverless function on Netlify to speed up the connection but even then images over 1mb took too long and timed out (there is a limit on the length of time a function can run with Netlify).

It would be great if you can let me know about the blob thing from the other thread too (e.g. how to create a blob Dato will accept) but I have a feeling even that would be too slow to run in a serverless function with the analysis happening on Dato’s end.

In the meantime I’ve switched the whole model to use Cloudinary image URLs and am just uploading directly to there which is very rapid and is working fine. Just not a nice an experience for the client to manage.

I’ll go and upvote that feature request too!

roger · March 22, 2024, 12:19am

@tim1, I just answered the other thread, but it’s still bad news there Yes, there was a bug, and even after fixing that bug, it still won’t run in a Node environment unless you polyfill XMLHttpRequest.

What is the underlying use case here? Am I understanding correctly that it’s going: User → Serverless Func → Cloudinary (and skipping Dato now)?

Is the idea just so that the frontend (i.e. your users) can directly upload images to Dato? If so, maybe this other similar thread can help? Image Upload from user-facing webpage to DatoCMS, without revealing API key (edited title)

In that case, we did it like this:

Get permissions to upload a file via an API route or serverless function (so that your end-users don’t see your API keys)
The user’s browser directly sends the file to AWS
Another route handler tells DatoCMS to finalize the upload

So there is no long-lived connection here except between the user’s browser and AWS directly, which seems to work OK.

Would that work for you, or is your use case different enough?

nroth · March 22, 2024, 11:53am

I found for our use case of reasonably sized images (<1MB), It isn’t the downloading the image on the client that is the issue. I monitored the state of the upload process and timing and it is CREATING_UPLOAD_OBJECT that takes by far the longest for me.

tim1 · March 25, 2024, 11:31pm

This is essentially what I tried to do but with Cloudinary instead of AWS in Netlify Server Functions. However as @nroth pointed out it’s the time to create the upload that is the issue - sometimes taking so long as to exceed the max Netlify function time (26s I think) for images that were around 5mb in size.

roger · April 5, 2024, 1:54am

Apologies for the delayed response here @tim1 and @nroth… I’m trying to see if we can rearrange the forum a bit to prevent posts like this from falling through the cracks Sorry about that.

I dug more deeply into this, and TLDR, the CREATING_UPLOAD_OBJECT step creates an async job on our server that you shouldn’t have to wait for, but our client makes you wait I’ll flag it for the devs and see if it’s an easy fix for them…

Good news, though:

In the meantime, here is a workaround that should let you exit early. It reimplements much of the same logic as our real client does (using some of its undocumented exports), but then exits early as soon as CREATING_UPLOAD_OBJECT begins, without waiting for it to finish (which is what was taking so long, but that’s just a serverside op that your client shouldn’t need to wait for).

That’s all your script or serverless func should need to do. Then, a few seconds later, the image will magically appear in your DatoCMS media area… but your function won’t have to keep waiting for that to happen (unlike our official client, which keeps polling until it succeeds, which as you saw can take like 30+ seconds for big enough images).

Here’s a sample implementation that should exit as soon as the “create” step begins and the job starts, without waiting for it to finish. The image itself won’t show up in the media area until about 30-40 secs later, but your client won’t have to wait:

import {buildClient, downloadFile, LogLevel, uploadLocalFileAndReturnPath} from "@datocms/cma-client-node";
import {makeCancelablePromise} from '@datocms/rest-client-utils';

let jobUrl: string | undefined;

const logParser = (message: string) => {
    // Exit early if the job URL is already set
    if (jobUrl) return;

    // Otherwise, parse the log messages and get back the job ID
    const jobUrlMatch = message.match(/GET (https:\/\/site-api\.datocms\.com\/job-results\/.+)$/);
    if (jobUrlMatch && jobUrlMatch[1]) {
        jobUrl = jobUrlMatch[1];
    }
}

const client = buildClient({
    apiToken: "YOUR_API_TOKEN",
    logLevel: LogLevel.BODY, // Important, keep this! We have to parse it manually to get your job ID :(
    logFn: logParser, // This is the function that parses the log to get the job ID
});

async function run() {
    const smallFile = "https://upload.wikimedia.org/wikipedia/commons/c/ca/Crater_Lake_from_Watchman_Lookout.jpg";
    const bigFile = 'https://upload.wikimedia.org/wikipedia/commons/7/7d/%22_The_Calutron_Girls%22_Y-12_Oak_Ridge_1944_Large_Format_%2832093954911%29_%282%29.jpg'

    // First download the file locally
    console.log('\nStarting download...')
    const downloadPromise = downloadFile(bigFile, {onProgress: handleProgress})
    const {filePath} = await downloadPromise;
    console.log(`File downloaded to ${filePath}`)

    // Then upload it to S3 and get back a path
    console.log('\nStarting upload...')
    const remotePath = await uploadLocalFileAndReturnPath(client, filePath, {onProgress: handleProgress})
    console.log(`File uploaded to ${remotePath}`)

    // Tell DatoCMS to link the S3 file to your media area
    // Note that we do NOT await it. We will forcibly cancel it later.
    console.log(`\nStarting async job to create file in Dato from your S3 upload...`);
    const asyncCreatePromise = makeCancelablePromise(
        client.uploads.rawCreate(
            {
                data: {
                    type: "upload",
                    attributes: {
                        path: remotePath
                    }
                }
            }
        ));

    console.log('Created the promise, but still waiting for a job URL...');

    (function waitForJobURL() {
        if (jobUrl) {
            console.log(`Found Job URL. Canceling the promise now...`);
            asyncCreatePromise.cancel();
            console.log(`It's safe to exit now. You can ignore the canceled promise error and manually check job status at ${jobUrl}`);
        } else {
            setTimeout(waitForJobURL, 50); // Check again in 100ms
        }
    })();
}

function handleProgress(info) {
    console.log("Phase:", info.type);
    console.log("Details:", info.payload);
}

run();

It’s really a pretty ugly hack, but it gets the job done. I’ll try to get the devs to update the official client too.

roger · April 5, 2024, 2:30am

And FYI, just to be clear, that script above (and our official client) basically just wraps our HTTP calls: Create a new upload - Upload - Content Management API

Looking at that breakdown, especially step 3, might make it clearer what’s actually going on behind the scenes.

I’ve also filed a ticket with the devs about this, and will let you know if they can update the official client to bypass the wait.

tim1 · April 7, 2024, 9:13pm

Thanks Roger - appreciated the follow up. Will give that a try.

nroth · April 9, 2024, 8:19pm

In my case, unfortunately, this doesn’t really help. This would be ok if you were just uploading something to the media area, but I’m trying to get back a reference to the media, so I can set it on a record in datocms and our content team can see visual feedback that it was loaded.

Our workflow is we share exceptional deals that we find on sites like amazon.com. So, to simplify the process, our team just needs to drop the amazon url into a field in DatoCMS, and we use the amazon API to retrieve the image url, price, list price, features, etc.

Given how long it can take for the image to show up end-to-end in the record editor in Dato, the process just takes a lot longer than I’d ever expect for such a tiny image in many cases.

The current end-to-end time just seems excessive compared to any other CMS i’ve tried.

roger · April 9, 2024, 10:41pm

I will specifically ask about this for you – why that last operation takes so long. Do you have an example image/URL of a relatively small image that still takes a long time?

nroth · April 10, 2024, 11:15am

Any amazon product image would be a good test case. This image is around 57kB. I see a minimum end-to-end time of 5s, but often is more around 10s. This image takes milliseconds to download on a reasonable connection.

My assumption here is that the way Dato handles the upload job is built to be able to handle just about any file size, but this means it isn’t well optimized to handle basic, small image uploads.

roger · April 11, 2024, 12:02am

Than you! I’ve updated our internal tracking with that information, and will let you know as soon as I hear back.

roger · April 15, 2024, 6:03pm

We had an architectural discussion about this and found out a little more information:

The reason it takes so long is that during that time, our backend is making several other API calls to generate blurhashes, thumbhashes, smart tags, dominant colors, and other EXIF and metadata.
A developer is investigating whether it’s possible to postpone some of these tasks until later on, but that may not be an easy thing to do (because it would probably need additional “image created but not all metadata ready yet” states in the API and UI). I wouldn’t count on this being added anytime soon.
However, we did find a cleaner workaround for the client to exit early (EDIT: fixed some bugs):

const client = buildClient({apiToken: "XXX"});

// Configure the client to fast return with a fake response instead of polling for the real job status
client.jobResultsFetcher = async (jobId) => {
    return {
        id: jobId,
        type: "job_result",
        status: 200,
        payload: {
            "data": {
                id: jobId,
            },
        },
    }
};

// This call will be much faster (no waiting for serverside operations), but the actual
// upload entity will not yet be created. It will actually be created when the job ends

// Optionally, you can generate your own Dato-style UUID client-side to have it immediately available:
const imageId = generateId(); // Must be a valid Dato-style ID
console.log(`Image ID will be ${imageId}`)

const result = await client.uploads.createFromUrl({
    id: imageId, // If you leave this out, the server will generate one for you, but you won't know what it is until the job is done
    url: "https://www.example.com/image.jpg",
});

// the result will be simply the job id:
console.log(result) // => { id: 'f62d2b3fb428b1ace205b5ad' }

// You can poll for it separately using 
// `await client.jobResults.find('f62d2b3fb428b1ace205b5ad')`

The end effect of this will be similar to my workaround above (you get a job status back immediately, at the risk of your client not waiting for 100% success of the image creation and all metadata being generated, i.e. you’ll need some separate, out-of-band check to ensure correctness). But it’s a much cleaner/shorter implementation using client.jobResultsFetcher() (which I didn’t know about before).

tim1 · April 15, 2024, 9:16pm

Thanks Roger,
Appreciate the follow up on this.
One question: if you don’t specify an ID in the createFromUrl function will the server still respond with an ID? Or is that ID not created until the image analysis and generation takes place?

roger · April 15, 2024, 10:32pm

@tim1, I’m so sorry, that code actually had some bugs. I should’ve tested it before posting it I’ve updated the post with the correct code and some clarifications:

The job ID is different from the image ID. An image ID uniquely identifies a specific image (or other file), and can be assigned by either the JS client or the server. Separately, a job ID is a server-side reference for an async job in the queue, which in this case is “creating” an image in Dato from an S3 asset – that “creation” step includes all the metadata API calls I mentioned earlier, which is why it’s so slow.
In createFromUrl(), if you provide your own Dato-compatible image ID, the uploaded image will use that instead. If you don’t provide one, the server will generate it for you, but because we’re returning early, you won’t know what it is until/unless the job finishes and you get back a full response. (i.e. you have to keep polling on your own, outside this script, until the job finishes and returns the image data).

To be clear, this means it’s much faster to generate a clientside Dato-compatible image ID (using generateId() from our lib) than to wait for the server to assign one. A Dato-compatible ID is really just a slightly transformed UUIDv4, so generating them clientside should be fine. That way you have immediate access to the image ID.
Regardless of your image ID, that client.jobResultsFetcher() will immediately return a job ID as soon as it starts. The job ID is completely independent of the image ID, and is just used for keeping track of the image processing progress. Normally our client would keep polling until that processing is done. In this case, we exit early and don’t wait at all.

If you want to, you can separately check that job’s status using await client.jobResults.find('jobId') (probably polling for it on a timer, in another script or serverless func). Once it’s done, it will respond with both the job ID (which you already know) and either the server-provided image ID or the one you manually assigned using generateId().

roger · April 16, 2024, 11:00pm

@tim1 and @nroth,

I wanted to provide another update: Several developers have discussed this extensively now, and they believe that the fast-exit strategy should suffice for this use case (i.e. using either client.jobResultsFetcher() or the manual workaround above). Either one should get you back the job ID (which you can query separately for success, if you need confirmation of completion, or else just ignore) without a long wait.

As such, they are not currently planning additional improvements to this endpoint/API method. Postponing the metadata generation (the step that makes it take so long, while the client polls and polls for completion) would be a breaking change from what we have now (since “created” images would no longer have the metadata ready), so it’s something they are reluctant to do.

What do you think? Will the workarounds suffice, or is there still some issue unaddressed here?

nroth · May 14, 2024, 10:35am

I think the fast exit with the solution of providing the image id is a decent solution for us. I’m going to try to utilize that approach.