Potential bug in datocms-html-to-structured-text npm package

If this is the wrong place to file this problem/bug, I can move it to the appropriate place

Issue(s)

  1. When attempting to change a block of HTML into structured text that has a <del> html tag, the resulting structured text has removed the <del>
  2. See the same thing happen with <sup> html tag

Oddly enough, If I were to pass in HTML with a <s> tag, the structured text has a "marks":["strikethrough"] in the appropriate place. If I take that structured text and pass it through the datocms-structured-text-to-html-string library, it will return HTML that has <del> tag instead of <s>. (Which seems legit to me).

Example code

import {parse} from 'parse5'
import {parse5ToStructuredText} from 'datocms-html-to-structured-text';

const html = "some block of html"

const dom = parse(html, {
            sourceCodeLocationInfo: true,
        });
result = await parse5ToStructuredText(dom);

Steps to repeat del issue

  1. Pass the following HTML to parse5ToStructuredText function:Hello world, this should be <del>striked</del>.
  2. Function returns: {"dast":{"schema":"dast","document":{"children":[{"type":"paragraph","children":[{"value":"Hello world, this should be ","type":"span"},{"value":"striked","type":"span"},{"value":".","type":"span"}]}],"type":"root"}}}

Steps to repeat sup issue

  1. Pass the following HTML to parse5ToStructuredText function: Hello world, this should be <sup>sup</sup>.
  2. Functions returns: {"dast":{"schema":"dast","document":{"children":[{"type":"paragraph","children":[{"value":"Hello world, this should be ","type":"span"},{"value":"sup","type":"span"},{"value":".","type":"span"}]}],"type":"root"}}}

Versions

"datocms-html-to-structured-text": "^3.0.0",
 "datocms-structured-text-to-html-string": "^3.0.0",

Hey @jearle,

Thanks for the detailed report!

Structured Text isn’t meant to have exhaustive support for every possible HTML tag, only some of the basic/semantic ones. You can see its supported nodes/marks here: Structured Text and Dast format — DatoCMS

In particular, <del> and <sup> aren’t part of its supported nodes/marks by default, so there’s nothing it could parse it to.

You can write your own handlers for these if you’d like: structured-text/packages/html-to-structured-text at readme-link-fix · datocms/structured-text · GitHub, but you’d also have to write your own renderers on the frontend: react-datocms/docs/structured-text.md at master · datocms/react-datocms · GitHub

If that’s too much work and you don’t need the full structure of Structured Text, it might be simpler to use a HTML field instead (it’s one of the multi-line string field presentation options… we include a WYSIWYG editor by default or you can supply your own using a field editor extension plugin, like this TinyMCE plugin).

I can understand not supporting <sup>, but I thought I ask about it, anyway.

But, what I don’t understand is that there appears to be some sort of support for <del> as there is a mark called “strikethrough” and when I render structured text with a mark of “strikethrough”, it puts a <del> tag in the rendered html. So, why would it not support converting <del> into a “strikethrough” mark. Essentially, the html your code has rendered is unsupported by the html-to-structured-text code.

Strikethrough does appear in the codebase as a “defaultMark”: structured-text/packages/utils/src/definitions.ts at main · datocms/structured-text · GitHub

I agree that sounds a little bit strange. Probably the more semantically correct fix would be for the structured text → HTML to replace the output <del> tag with a <s> instead, but that still wouldn’t help you with the other way around (HTML → structured text).

My understanding is that the <del> tag should usually be accompanied with a <ins> tag (HTML del tag), which isn’t quite as simple to deal with. It’s not just a simple substitute for <s>, I don’t think. So our output should probably be <s>, since the Structured Text editor UI only allows for striking through and not the real concepts of deletion/insertion.

How are you using those tags in the source HTML? Are the <del>s accompanied by <ins> too, or can you replace them with <s> before import?

So, I guess I should probably explain my use case here and what caused me to report this.

We are transitioning to DatoCMS, but we want to keep our old CMS in-sync with DatoCMS until everyone involved is comfortable with making the move. I’m writing code to keep the two systems in sync and I noticed that a HTML field that we have makes use of the <del> tag. I doubt it’s being used as intended, but it just caught my eye that in the cases where <del> is used, it was effectively being removed in the sync between the two systems.

It was decided they want to use structured text on the DatoCMS side for this field for future ideas that they have on how to make use of the field. So we’re trying to convert our existing data to it. I don’t think anyone expects it to be exact, if a <del> gets converted to a <s>, I don’t think anyone is going to care. But, tags that have an affect on the presentation of the content suddenly being removed is not a desired outcome.

I have no problem writing code to accommodate differences in how systems treat the data. But I thought it would be something I would want to know about if it was my codebase. Definitely your choice on whether to fix it or not (or whether you think it’s something that needs to be fixed). I just thought it seemed like a bug to have code that renders something that it is unable to parse.

Thank you for the explanation, @jearle! I’ll bring up the <del> vs <s> situation with the devs and see if they have any thoughts. I can’t promise that we’ll support importing <del> after that… probably depends on what they think of the <ins> situation. But I’ll let you know as soon as I hear back!

For your use case, it might be good to not only handle the <del> tag but also have the importer check for any other unsupported tags?

In general though, I should probably clarify something:

HTML and Structured Text are not really meant to be direct equivalents (that’s why there is a HTML field type), and conversions between them will be lossy more often than not. Structured Text is neither a subset nor superset of HTML, but its own thing, that just happens to share some overlap with simpler HTML constructs like paragraphs, headings, etc. It’s kinda like Markdown in that regard; there’s not necessarily a 1:1 equivalence except in the simplest cases.

There are many situations where going from one to the other will result in lossy conversions. Some random examples:

Not saying this to be pedantic, but to clarify that, by design, it’s not a 100%-fidelity HTML import or export. It’s its own format meant to encapsulate some basic rich text, yes, but also more complex relationships expressed in a CMS that HTML doesn’t readily handle.

It’s expected that different users will write their own handlers to define the imported schema and output behavior that they want (such as importing into custom blocks and out as custom node renderers).

In your case, during this transition period, it might be helpful to have two fields inside each DatoCMS record: one field can be the “work in progress” Structured Text that your input converter creates, and another can be the “backup” raw HTML from your original CMS, just in case there are any other unaccounted-for tags you need to fix in the future.


In any case, thank you again for this clarification, and I’ll let the devs know about the <del> situation!

Ok, thanks. Please let me know if they decide to look into it further and need more information.

I’ve already written some code to get around these problems and will just continue on with that.

@jearle They’re released v4.0.0 of the npm package, which now renders strikethroughs as a <s> tag instead of <del>: structured-text/packages/generic-html-renderer/src/index.ts at main · datocms/structured-text · GitHub

Unfortunately, this means we still don’t support the direct import of <del> or <ins> tags.

Thank you. That sounds like a good solution to me.

1 Like