Potential bug in datocms-html-to-structured-text npm package

If this is the wrong place to file this problem/bug, I can move it to the appropriate place

Issue(s)

  1. When attempting to change a block of HTML into structured text that has a <del> html tag, the resulting structured text has removed the <del>
  2. See the same thing happen with <sup> html tag

Oddly enough, If I were to pass in HTML with a <s> tag, the structured text has a "marks":["strikethrough"] in the appropriate place. If I take that structured text and pass it through the datocms-structured-text-to-html-string library, it will return HTML that has <del> tag instead of <s>. (Which seems legit to me).

Example code

import {parse} from 'parse5'
import {parse5ToStructuredText} from 'datocms-html-to-structured-text';

const html = "some block of html"

const dom = parse(html, {
            sourceCodeLocationInfo: true,
        });
result = await parse5ToStructuredText(dom);

Steps to repeat del issue

  1. Pass the following HTML to parse5ToStructuredText function:Hello world, this should be <del>striked</del>.
  2. Function returns: {"dast":{"schema":"dast","document":{"children":[{"type":"paragraph","children":[{"value":"Hello world, this should be ","type":"span"},{"value":"striked","type":"span"},{"value":".","type":"span"}]}],"type":"root"}}}

Steps to repeat sup issue

  1. Pass the following HTML to parse5ToStructuredText function: Hello world, this should be <sup>sup</sup>.
  2. Functions returns: {"dast":{"schema":"dast","document":{"children":[{"type":"paragraph","children":[{"value":"Hello world, this should be ","type":"span"},{"value":"sup","type":"span"},{"value":".","type":"span"}]}],"type":"root"}}}

Versions

"datocms-html-to-structured-text": "^3.0.0",
 "datocms-structured-text-to-html-string": "^3.0.0",

Hey @jearle,

Thanks for the detailed report!

Structured Text isnā€™t meant to have exhaustive support for every possible HTML tag, only some of the basic/semantic ones. You can see its supported nodes/marks here: Structured Text and Dast format ā€” DatoCMS

In particular, <del> and <sup> arenā€™t part of its supported nodes/marks by default, so thereā€™s nothing it could parse it to.

You can write your own handlers for these if youā€™d like: structured-text/packages/html-to-structured-text at readme-link-fix Ā· datocms/structured-text Ā· GitHub, but youā€™d also have to write your own renderers on the frontend: react-datocms/docs/structured-text.md at master Ā· datocms/react-datocms Ā· GitHub

If thatā€™s too much work and you donā€™t need the full structure of Structured Text, it might be simpler to use a HTML field instead (itā€™s one of the multi-line string field presentation optionsā€¦ we include a WYSIWYG editor by default or you can supply your own using a field editor extension plugin, like this TinyMCE plugin).

I can understand not supporting <sup>, but I thought I ask about it, anyway.

But, what I donā€™t understand is that there appears to be some sort of support for <del> as there is a mark called ā€œstrikethroughā€ and when I render structured text with a mark of ā€œstrikethroughā€, it puts a <del> tag in the rendered html. So, why would it not support converting <del> into a ā€œstrikethroughā€ mark. Essentially, the html your code has rendered is unsupported by the html-to-structured-text code.

Strikethrough does appear in the codebase as a ā€œdefaultMarkā€: structured-text/packages/utils/src/definitions.ts at main Ā· datocms/structured-text Ā· GitHub

I agree that sounds a little bit strange. Probably the more semantically correct fix would be for the structured text ā†’ HTML to replace the output <del> tag with a <s> instead, but that still wouldnā€™t help you with the other way around (HTML ā†’ structured text).

My understanding is that the <del> tag should usually be accompanied with a <ins> tag (HTML del tag), which isnā€™t quite as simple to deal with. Itā€™s not just a simple substitute for <s>, I donā€™t think. So our output should probably be <s>, since the Structured Text editor UI only allows for striking through and not the real concepts of deletion/insertion.

How are you using those tags in the source HTML? Are the <del>s accompanied by <ins> too, or can you replace them with <s> before import?

So, I guess I should probably explain my use case here and what caused me to report this.

We are transitioning to DatoCMS, but we want to keep our old CMS in-sync with DatoCMS until everyone involved is comfortable with making the move. Iā€™m writing code to keep the two systems in sync and I noticed that a HTML field that we have makes use of the <del> tag. I doubt itā€™s being used as intended, but it just caught my eye that in the cases where <del> is used, it was effectively being removed in the sync between the two systems.

It was decided they want to use structured text on the DatoCMS side for this field for future ideas that they have on how to make use of the field. So weā€™re trying to convert our existing data to it. I donā€™t think anyone expects it to be exact, if a <del> gets converted to a <s>, I donā€™t think anyone is going to care. But, tags that have an affect on the presentation of the content suddenly being removed is not a desired outcome.

I have no problem writing code to accommodate differences in how systems treat the data. But I thought it would be something I would want to know about if it was my codebase. Definitely your choice on whether to fix it or not (or whether you think itā€™s something that needs to be fixed). I just thought it seemed like a bug to have code that renders something that it is unable to parse.

Thank you for the explanation, @jearle! Iā€™ll bring up the <del> vs <s> situation with the devs and see if they have any thoughts. I canā€™t promise that weā€™ll support importing <del> after thatā€¦ probably depends on what they think of the <ins> situation. But Iā€™ll let you know as soon as I hear back!

For your use case, it might be good to not only handle the <del> tag but also have the importer check for any other unsupported tags?

In general though, I should probably clarify something:

HTML and Structured Text are not really meant to be direct equivalents (thatā€™s why there is a HTML field type), and conversions between them will be lossy more often than not. Structured Text is neither a subset nor superset of HTML, but its own thing, that just happens to share some overlap with simpler HTML constructs like paragraphs, headings, etc. Itā€™s kinda like Markdown in that regard; thereā€™s not necessarily a 1:1 equivalence except in the simplest cases.

There are many situations where going from one to the other will result in lossy conversions. Some random examples:

Not saying this to be pedantic, but to clarify that, by design, itā€™s not a 100%-fidelity HTML import or export. Itā€™s its own format meant to encapsulate some basic rich text, yes, but also more complex relationships expressed in a CMS that HTML doesnā€™t readily handle.

Itā€™s expected that different users will write their own handlers to define the imported schema and output behavior that they want (such as importing into custom blocks and out as custom node renderers).

In your case, during this transition period, it might be helpful to have two fields inside each DatoCMS record: one field can be the ā€œwork in progressā€ Structured Text that your input converter creates, and another can be the ā€œbackupā€ raw HTML from your original CMS, just in case there are any other unaccounted-for tags you need to fix in the future.


In any case, thank you again for this clarification, and Iā€™ll let the devs know about the <del> situation!

Ok, thanks. Please let me know if they decide to look into it further and need more information.

Iā€™ve already written some code to get around these problems and will just continue on with that.

@jearle Theyā€™re released v4.0.0 of the npm package, which now renders strikethroughs as a <s> tag instead of <del>: structured-text/packages/generic-html-renderer/src/index.ts at main Ā· datocms/structured-text Ā· GitHub

Unfortunately, this means we still donā€™t support the direct import of <del> or <ins> tags.

Thank you. That sounds like a good solution to me.

1 Like