« Excellent Centering with CSS Resource | Main | Thygeson's Disease »

July 3, 2005

ColdFusion Screen Scraping Blog-Rebuilding Script

Ugh: my soon-to-be former host, AffordableHost, somehow torched my MySQL database that had all of my MovableType databases earlier this week. All that was left was mt_blog, mt_category and mt_author -- the three easiest tables to manually rebuild. Fortunately, the blog couldn't be rebuilt, so my static pages were untouched in the /archives directory.

In the interest of my own sanity, I wrote a screen scraping script in ColdFusion to go through each page and dump it to an SQL file with the appropriate INSERTs. If you're interested, I put the two scripts (minus any self-referencing parts) in the extended entry below. Enter your site in the ***YOUR SITE HERE*** area and it will dump your blog into a TEXTAREA with MySQL-friendly syntax. It should properly handle escaping quotes and whatnot. Worked fine for my 150-ish posts, but your mileage may vary.

Either way, it's good to be back. I'm on with PowWeb now, which has a vastly improved package compared to AffordableHost's (which has really gone downhill since it was purchased by DotCanada earlier this year). Hopefully I'll continue to be happy, but the package is pretty killer.

Okay, here's the form page:

<cfsetting enablecfoutputonly="Yes">

<cfif not isdefined("form.sitename")>
<cfoutput>
<form action="" method="post">
Site URL: <input name="sitename" type="text">
<BR>
Subdirectory for archives: /<input name="archives" type="text" value="archives">
<BR><input type="submit">

</form>
</cfoutput>
<cfelse>

<cfoutput>Generating content for #form.sitename#...
</cfoutput><BR><BR>

<cfset allMyChildren = ArrayNew(3)>
<cfoutput><textarea rows=80 cols=120></cfoutput>
<cfloop from="101" to="185" index="fred">

<cfset currentFred = fred>

<cfif fred lt 10><cfset currentFred = "00000" & fred>
<cfelseif fred lt 100 and fred gte 10><cfset currentFred = "0000" & fred>
<cfelseif fred lt 1000 and fred gte 100><cfset currentFred = "000" & fred>
</cfif>

<cfset currentURL = "http://www.***YOUR SITE HERE***.com/archives/" & currentFred & ".html">


<cfinvoke component="rebuildmt" method="getAllFiles" returnvariable="allFiles">
<cfinvokeargument name="mySite" value="#form.sitename#">
<cfinvokeargument name="myDirectory" value="#form.archives#">
<cfinvokeargument name="whichfile" value="#currentURL#">
</cfinvoke>

<cfoutput><cfif allFiles[1] neq "NULL">INSERT INTO mt_entry VALUES (#fred#,1,2,2,0,0,'__default__',NULL,'#allFiles[2]#','','#allFiles[3]#','','',NULL,NULL,NULL,'#allFiles[1]#','#allFiles[1]#',NULL,NULL,'');

</cfif></cfoutput>
</cfloop>
<cfoutput></textarea></cfoutput>

</cfif>

================================================================

And here's the .CFC it calls:

<cfcomponent>
<cffunction name="getAllFiles" access="public" returntype="array">
<cfargument name="mySite" type="string" required="true">
<cfargument name="myDirectory" type="string" required="true">
<cfargument name="whichFile" type="string" required="true">

<cfset myPath = mySite & "/" & myDirectory>


<cfhttp url="#whichFile#" method="get" resolveurl="yes" />

<cfset thisfile = cfhttp.FileContent>
<cfset allMyChildren = ArrayNew(1)>

<cfset headlinestring = "<h3 class=\Stitle\S>.*</h3>">
<cfset headline = ReFindNoCase(headlinestring,thisfile,1,"true")>
<cfif headline.pos[1] neq 0>
<cfset headlineOutput = #Mid(thisfile,#Evaluate(headline.pos[1]+18)#,#Evaluate(headline.len[1]-23)#)#>
<cfset headlineOutput = #Replace(headlineOutput,"&apos;", "''","ALL")#>
<cfset headlineOutput = #Replace(headlineOutput,"'", "''","ALL")#>
<cfset allMyChildren[2] = headlineOutput>
<cfelse>
<cfset allMyChildren[2] = "NULL">
</cfif>

<cfset bodystring = "</h3>.*<a name=\Smore\S>">
<cfset body = ReFindNoCase(bodystring,thisfile,1,"true")>
<cfif body.pos[1] neq 0>
<cfset bodyOutput = #XMLFormat(Mid(thisfile,#Evaluate(body.pos[1]+7)#,#Evaluate(body.len[1]-23)#))#>

<cfset bodyOutput = #Replace(bodyOutput,"&apos;", "''","ALL")#>
<cfset allMyChildren[3] = bodyOutput>
<cfelse>
<cfset allMyChildren[3] = "NULL">
</cfif>

<cfset datestring = "date=">
<cfset date = ReFindNoCase(datestring,thisfile,1,"true")>
<cfif date.pos[1] neq 0>
<cfset dateOutput = #Mid(thisfile,#Evaluate(date.pos[1]+6)#,#Evaluate(date.len[1]+14)#)#>
<cfset dateOutput = #Replace(dateOutput,"T", " ")#>
<cfset allMyChildren[1] = dateOutput>
<cfelse>
<cfset allMyChildren[1] = "NULL">
</cfif>


<cfreturn allMyChildren>
</cffunction>
</cfcomponent>
========================================================= Enjoy!

Posted by Lee Clontz at July 3, 2005 2:52 PM

Comments

Post a comment




Remember Me?