A bookmarklet to copy all rendered text content (including hidden) of the current page to clipboard
(Drag this button to your boormarks toolbar)(() => {
const body = document.body.cloneNode(true);
body.querySelectorAll('*').forEach(e => {
e.removeAttribute('style');
e.removeAttribute('class');
});
body.querySelectorAll('script,svg,link,style,iframe,noscript').forEach(e => e.remove());
body.querySelectorAll('select').forEach(e => {
var d = document.createElement('div');
d.textContent = e.innerText;
e.replaceWith(d);
});
body.querySelectorAll('img').forEach(e => {
var d = document.createElement('div');
d.textContent = e.alt;
e.replaceWith(d);
});
body.querySelectorAll('div, p, section, article, header, footer, aside, main, nav, li, table, tr, td, th, blockquote, h1, h2, h3, h4, h5, h6, pre, select, option, label, textarea')
.forEach(el => {
el.insertAdjacentText('beforebegin', ' ');
el.insertAdjacentText('afterend', ' ');
});
navigator.clipboard.writeText(body.innerText.replace(/\s+/g, ' ').trim());
})();
The bookmarklet copies the rendered content of the page. Only the content inside the body
tag will be copied. All img
images will be replaced with the content of their alt
attributes. The bookmarklet will add spaces between the contents of block style elements (div
, p
, etc).
Converting HTML page to a plain text is a surprisingly complex operation with many variables. Fine-tuning the HTML-to-text conversion algorithm might have a significant impact on the results of embeddings and TF/IDF.
Leave a comment here: @ugnich