Copy the text of the current page to clipboard

A bookmarklet to copy all rendered text content (including hidden) of the current page to clipboard

Copy
(Drag this button to your boormarks toolbar)

Source code

(() => {
    const body = document.body.cloneNode(true);

    body.querySelectorAll('*').forEach(e => {
        e.removeAttribute('style');
        e.removeAttribute('class');
    });

    body.querySelectorAll('script,svg,link,style,iframe,noscript').forEach(e => e.remove());

    body.querySelectorAll('select').forEach(e => {
        var d = document.createElement('div');
        d.textContent = e.innerText;
        e.replaceWith(d);
    });

    body.querySelectorAll('img').forEach(e => {
        var d = document.createElement('div');
        d.textContent = e.alt;
        e.replaceWith(d);
    });

    body.querySelectorAll('div, p, section, article, header, footer, aside, main, nav, li, table, tr, td, th, blockquote, h1, h2, h3, h4, h5, h6, pre, select, option, label, textarea')
            .forEach(el => {
                el.insertAdjacentText('beforebegin', ' ');
                el.insertAdjacentText('afterend', ' ');
            });

    navigator.clipboard.writeText(body.innerText.replace(/\s+/g, ' ').trim());
})();

Notes

The bookmarklet copies the rendered content of the page. Only the content inside the body tag will be copied. All img images will be replaced with the content of their alt attributes. The bookmarklet will add spaces between the contents of block style elements (div, p, etc).

Converting HTML page to a plain text is a surprisingly complex operation with many variables. Fine-tuning the HTML-to-text conversion algorithm might have a significant impact on the results of embeddings and TF/IDF.

Feedback

Leave a comment here: @ugnich