The personal website of Scott W Harden
⚠️ Warning: This article is obsolete.
Articles typically receive this designation when the technology they describe is no longer relevant, code provided is later deemed to be of poor quality, or the topics discussed are better presented in future articles. Articles like this are retained for the sake of preservation, but their content should be critically assessed.

Removing Textile Markup From Wordpress Entries

I realized that the C code from yesterday wasn't showing-up properly because of textile, a rapid, inline, tag-based formatting system. While it's fun and convenient to use, it's not always practical. The problem I was having was that in C code, variable names (such as delay) were becoming irrevocably italicized, and nothing I did could prevent textile from ignoring code while styling text. The kicker is that I couldn't disable it easily, because I've been writing in this style for over four years! I decided that the time was now to put my mad Python skills to the test and write code to handle the conversion from textile-format to raw HTML. I accomplished this feat in a number of steps. Yeah, I could have done hours of research to find a "faster way", but it simply wouldn't have been as creative. In a nutshell, I backed-up the SQL database using PHPMyAdmin to a single "x.sql" file. I then wrote a pythons script to parse this [massive] file and output "o.sql", the same data but with all of the textile tags I commonly used replaced by their HTML equivalent. It's not 100% perfect, but it's 99.999% perfect. I'll accept that. The output? You're viewing it! Here's the code I used to do it:

## This Python script removes *SOME* textile formatting from Wordpress
## backups in plain text SQL format (dumped from PHP MyAdmin). Specifically,
## it corrects bold and itallic fonts and corrects links. It should be easy
## to expand if you need to do something else with it.

infile = 'x.sql'

replacements=   ["r"," "],["n"," n "],["*:","* :"],["_:","_ :"],
                ["n","<br>n"],[">*","> *"],["*< ","* <"],
                [">_","> _"],["_< ","_ <"],
                [" *"," <b>"],["* "," "],[" _"," <i>"],["_ ","</i> "]
                #These are the easy replacements

def fixLinks(line):
    ## replace ["links":URL] with [<a href="URL">links</a>]. ##
    words = line.split(" ")
    for i in range(len(words)):
        word = words[i]
        if '":' in word:
            upto=1
            while (word.count('"')<2):
                word = words[i-upto]+" "+word
                upto+=1
            word_orig = word
            extra=""
            word = word.split('":')
            word[0]=word[0][1:]
            for char in ".),'":
                if word[1][-1]==char: extra=char
            if len(extra)>0: word[1]=word[1][:-1]
            word_new='<a href="%s">%s</a>'%(word[1],word[0])+extra
            line=line.replace(word_orig,word_new)
    return line

def stripTextile(orig):
    ## Handle the replacements and link fixing for each line. ##
    if not orig.count("', '") == 13: return orig #non-normal post
    line=orig
    temp = line.split
    line = line.split("', '",5)[2]
    if len(line)<10:return orig #non-normal post
    origline = line
    line = " "+line
    for replacement in replacements:
        line = line.replace(replacement[0],replacement[1])
    line=fixLinks(line)
    line = orig.replace(origline,line)
    return line

f=open(infile)
raw=f.readlines()
f.close
posts=0
for raw_i in range(len(raw)):
    if raw[raw_i][:11]=="INSERT INTO":
        if "wp_posts" in raw[raw_i]: #if it's a post, handle it!
            posts+=1
            print "on post",posts
            raw[raw_i]=stripTextile(raw[raw_i])

print "WRITING..."
out = ""
for line in raw:
    out+=line
f=open('o.sql','w')
f.write(out)
f.close()

I certainly held my breath while the thing ran. As I previously mentioned, this thing modified SQL tables. Therefore, when I uploaded the "corrected" versions, I kept breaking the site until I got all the bugs worked out. Here's an image from earlier today when my site was totally dead (0 blog posts)

Newer: ATTiny2313 Controlling a HD44780 LCD with AVR-GCC
Older: Simple Case AVR/PC Serial Communication via MAX232
All Blog Posts