Formatos y Conversiones¶

Old Google sites to Dokuwiki¶

Exportar el viejo Google Sites como HTML, usango google takeout.
Extraer el contenido con un script (python/beautifulsoup), descartando header y sidebar.
Usar pandoc para convertir el contenido de HTML a Markdown (o Dokuwiki, etc.).
Hacerlo para cada archivo.

Si bien esto no preserva los links, puede automatizar gran parte del trabajo.

Detalles¶

Adaptar y correr el script:

import urllib.request
import bs4 as bs4
import glob
import os

# Iterar sobre cada archivo HTML en una carpeta
# y generar archivos Dokuwiki en ./clean (y archivos intermediarios en ./tmp).
lista = glob.glob("*.html")

for url in lista:

    # url = "fisica-cpu.html"
    try:
        with open(url, "r") as myfile:
            url_contents = myfile.read()

        # url_contents = urllib.request.urlopen(url).read()

        soup = bs4.BeautifulSoup(url_contents, "html")

        # div = soup.find("div", {"dir": "ltr"})
        div = soup.find("div", {"id": "sites-canvas-main-content"})
        div = div.find("div", {"dir": "ltr"})

        content = str(div)

        print(url + content[:50])

        tmp_path = "tmp/" + url
        with open(tmp_path, "w") as myfile:
            myfile.write(content)

        os.system("pandoc " + tmp_path + " --from html --to markdown_github-raw_html -o " + tmp_path + ".md" + " --wrap=none")

        clean_path = "clean/" + url
        os.system("pandoc " + tmp_path + ".md" + " --from markdown_github-raw_html --to dokuwiki -o " + clean_path + ".txt")


    except Exception as e:
        print("\nerror while processing: " + url + "\n" + str(e) + "\n")

Nota: requiere instalar Beutiful Soup: pip3 install beautifulsoup4.

Convertir con pandoc:

cat clean.content.html | pandoc --from html --to markdown_github-raw_html -o fisica-cpu.txt --wrap=none

En vez de markdown_github-raw_html pueden usar markdown_strict o dokuwiki.