What's the recommended way to implement a field that only allows lowercase values?

jaakko.jokinen · October 7, 2024, 7:45am

Describe the issue:

I am trying to validate a field to only allow lowercase values. This is to help content maintainers create data that works as expected downstream. The value is an ID of sorts. It may contain uppercase characters in the source system A where the content maintainer copies the value from, but the value in Dato must not contain any uppercase characters so that the ID works properly in system B. We are in an intermediary phase and trying to move away from this pattern, but it will take some time. To make it easier for content maintainers to remember this data requirement, we’d like to implement a validation that only allows lowercase letters.

(Optional) Do you have any sample code you can provide?

  await client.fields.update('model::field', {
    validators: {
      format: {
        custom_pattern: '^(?!.*p{Lu}).*$',
      },
    },
  });

I have attempted to use the Lu unicode category to create a match that disallows all uppercase characters in the alphabets that are relevant to the project. This works in principle, but I don’t think that I can make it work with Dato as it appears the regex is tested with JS in which case I would need to provide the u regex flag in order to get this match to behave correctly.

Another option could be to list all other uppercase characters that can’t be covered with A-Z, but it seems like an unfortunate amount of special cases to maintain:

^(?!.*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽ]).*$

It would be nicer if I could offload all of that thinking to unicode categories and the knowledgeable people who have worked to create them.

The UI will also display this quite complex and technical pattern. It may seem suspicious to people without a technical background.

As there don’t seem to be other questions about this topic, I am wondering whether I’ve missed some more idiomatic way to ensure data quality in a case like this.

roger · October 7, 2024, 6:28pm

Hi @jaakko.jokinen,

This is really interesting. You specifically want a unicode-aware lowercase regex that supports accented lowercase characters too? Hmm… let me look into this for a bit and see if I can come up with anything. Maybe something with computed fields or a similar plug-in with friendlier validation messages?

But in the meantime… are you sure it’s a good idea to use non-ASCII characters as an ID field? I know it’s 2024, and maybe I’m just old-school, but I would be wary of using that as an identifier in any sort of sync with external systems… it wasn’t that long ago that non-ASCII indices would break in all sorts of systems in unexpected ways

I don’t know your exact use case, but as an alternative, might it be safer to do a “slugification” on that field to both ASCII-ify and deduplicate the original text, and use that derived slug as the ID?

For example:

The original user input in system A is “Crème Brûlée”. An editor pastes that into a regular text field in DatoCMS, like product_name.
Using computed fields or a similar plugin, you slugify that into creme-brulee (validated for uniqueness), put it in another Dato field like internal_id and send that to the external system B.
If an editor attempts to make a “crème brûlée” or “CrÈmE_BrÛlÉE” record, the auto-slugified ID would still be the same, creme-brulee, and cause a validation error.

Primarily this has the benefit of detaching the human-readable name from the strictness required for machine IDs. Using a plugin would let you more verbosely define the rules for your ID, while also presenting friendlier editor-facing errors that explains what’s going on if they try to duplicate an existing record with disallowed name variations.

Would that work, or do you still need something more like your original request?

roger · October 8, 2024, 9:44pm

@jaakko.jokinen: A bit of good news. Turns out you should be able to use the regex validation after all… I think maybe you just needed the \ escape? And maybe don’t need the lookahead?

^\p{Ll}+$

It seems to work:

Also, if you add help text to the field in its presentation tab, you can tell your editors something more helpful:

roger · October 8, 2024, 11:56pm

Before I realized that regex actually works (whoops ), I was also working on a demo plugin for you that lets you use a custom JS function as a validator:

Let me know if that would still help, or if the regex is enough?

jaakko.jokinen · October 10, 2024, 11:41am

Thank you @roger !

I’ll try out the suggestion.

Using an ID like this isn’t an approach I would recommend. One of those things where we are trying to get a better pattern implemented, but aren’t quite there yet. Soon™.

jaakko.jokinen · October 10, 2024, 12:02pm

@roger I am probably missing something here. I can see that you were able to make it work–but when I try the same regex you used, it will disallow all input, even entirely lowercase input.

According to MDN, the u (“unicode aware mode”) flag is required for the escape:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode#unicode-aware_mode

I tried to implement the regex again as you suggested (in Dato), but in my tests lowercase values would not be permitted.

Here’s a screenshot that shows a test I ran directly in the browser console (I assume the validation in the Dato UI happens in the browser):

I am using Chrome 129. Also tested with FF 115.

roger · October 10, 2024, 3:14pm

@jaakko.jokinen Can you please take a few screenshots or recordings to show where it’s going on? (Or DM me a link to the particular record where you’re trying this on?)

It should be a serverside validation:

If you open the network console, you can see that happen onBlur() (when you navigate away from the text field).

Maybe the confusion is that if you don’t click away from the field, it’s not doing the validation as you type, so it seems invalid still, but actually it’s just waiting for the input to lose focus before retrying validation…?

If that’s what’s causing it, I’ll report it as a UX issue Otherwise, can you please provide more details about how it’s failing?

jaakko.jokinen · October 11, 2024, 12:37pm

Ah, I spotted the issue while making the recording for you.

The original regex disallowed all uppercase characters. The adjusted regex allows all lowercase letters. It doesn’t allow special characters such as spaces. I just skimmed over that change earlier.

So in essence the ID has even more dirty tricks in its sleeve than you expected!

I should have checked the validation routine more carefully. My fumble. Thanks for your help!

E: An additional issue here was that our linter “fixed” the “unnecessary” escape character in our migration file where we added the validation. We ended up having a broken version of the regex pushed to Dato which was what we tested ended up testing.

E2: I need to get back to this as there’s something else in the tooling that dislikes the escape character. I can set the regex correctly through the UI and get intended results for validation. However, when I push the change as a migration, the escape is lost somewhere along the way.

export default async function migration(client: Client): Promise<void> {
  await client.fields.update('model::field', {
    hint: 'Value must be lowercase',
    validators: {
      required: {},
      unique: {},
      format: {
        custom_pattern: '^[^\p{Lu}]*$',
      },
    },
  });
}

I run

pnpm dlx @datocms/cli@2 migrations:run --destination=environment-name --profile=default --log-level=BODY

I see

[15] PUT https://site-api.datocms.com/fields/model::field
[15] {
  "data": {
    "id": "model::field",
    "type": "field",
    "attributes": {
      "hint": "Value must be lowercase",
      "validators": {
        "required": {},
        "unique": {},
        "format": {
          "custom_pattern": "^[^p{Lu}]*$"
        }
      }
    }
  }
}

pnpm dlx @datocms/cli@2 --version
@datocms/cli/2.0.14 darwin-x64 node-v20.13.1

Couldn’t determine the place where the escape is lost.

jaakko.jokinen · October 14, 2024, 6:49am

I was able to narrow down the presence of the escape character somewhat:

It’s still present during this line:

https://github.com/datocms/cli/blob/adfe0acbc55e1d21acef9ab63d81c9b10b837295/packages/cli/src/commands/migrations/run.ts#L237

I checked by running migration.toString().

However it’s missing here:

https://github.com/datocms/js-rest-api-clients/blob/e016de74da18c37175fc1e4bb8450c80aff10885/packages/cma-client/src/generated/resources/Field.ts#L76

The body parameter no longer includes the escape.

This is probably expected. Some kind of escaping or something that takes place. I’m just not familiar enough with how regex works with JS to be able to pin it down.

Using a double escape seems to work though. Writing ^[^\\p{Lu}]*$ results in ^[^\p{Lu}]*$ being set as the validation in the UI.

roger · October 14, 2024, 6:57pm

Good catch @jaakko.jokinen! It seems like the \p gets eaten by the string parser. You can double-escape it, as you found, or use String.raw with backticks:

String.raw`^[^\p{Lu}]*$`