ColdFusion Development: Populating a Verity Index From Your Site Map

Populating a Verity Index From Your Site Map

Posted At : November 30, 2007 2:22 PM | Posted By : Scott Bennett
Related Categories: ColdFusion

Once of the most important things when developing a website is making sure that it is easy for people to find the information they need. Site maps and site searches are probably the most commonly implemented functionalities for making a sites content easily accessible. Whenever I build a site that is more than just a few pages, I usually create a site map that dynamically generates links to every page on the site. Then I use the script below which reads the sitemap and then crawls the whole site and indexes the content into a verity collection to power my search functionality.

indexSite.cfm


<cfscript>
function RemoveHTML(source){

   // Remove HTML Development formatting
   // Replace line breaks with space
   var result = Replace(source,chr(13), " ","ALL");

   // Remove repeating spaces becuase browsers ignore them
   result = ReReplace(result, "( )+", " ","ALL");

   // Remove the header (prepare first by clearing attributes)
   result = ReReplace(result, "<( )*head([^>])*>","<head>", "ALL");
   result = ReReplace(result, "(<( )*(/)( )*head( )*>)","</head>", "ALL");
   result = ReReplace(result, "(<head>).*(</head>)","", "ALL");

   // remove all scripts (prepare first by clearing attributes)
   result = ReReplace(result, "<( )*script([^>])*>","<script>", "ALL");
   result = ReReplace(result, "(<( )*(/)( )*script( )*>)","</script>", "ALL");
   result = ReReplace(result, "(<script>).*(</script>)","", "ALL");

   // remove all styles (prepare first by clearing attributes)
   result = ReReplace(result, "<( )*style([^>])*>","<style>", "ALL");
   result = ReReplace(result, "(<( )*(/)( )*style( )*>)","</style>", "ALL");
   result = ReReplace(result, "(<style>).*(</style>)","", "ALL");

   // insert tabs in spaces of <td> tags
   result = ReReplace(result, "<( )*td([^>])*>","   ", "ALL");

   // insert line breaks in places of <BR> and <LI> tags
   result = ReReplace(result, "<( )*br( )*>",chr(13), "ALL");
   result = ReReplace(result, "<( )*li( )*>",chr(13), "ALL");

   // insert line paragraphs (double line breaks) in place
   // if <P>, <DIV> and <TR> tags
   result = ReReplace(result, "<( )*div([^>])*>",chr(13), "ALL");
   result = ReReplace(result, "<( )*tr([^>])*>",chr(13), "ALL");
   result = ReReplace(result, "<( )*p([^>])*>",chr(13), "ALL");

   // Remove remaining tags like <a>, links, images,
   // comments etc - anything thats enclosed inside < >
   result = ReReplace(result, "<[^>]*>","", "ALL");

   // replace special characters:
   result = ReReplace(result, " "," ", "ALL");
   result = ReReplace(result, "•"," * ", "ALL");
   result = ReReplace(result, "&lsaquo;","<", "ALL");
   result = ReReplace(result, "&rsaquo;",">", "ALL");
   result = ReReplace(result, "™","(tm)", "ALL");
   result = ReReplace(result, "&frasl;","/", "ALL");
   result = ReReplace(result, "<","<", "ALL");
   result = ReReplace(result, ">",">", "ALL");
   result = ReReplace(result, "©","(c)", "ALL");
   result = ReReplace(result, "®","(r)", "ALL");

   // Remove all others. More special character conversions
   // can be added above if needed
   result = ReReplace(result, "&(.{2,6});", "", "ALL");

   // Thats it.
   return result;

}
</cfscript>


<cffunction name="FindURLs" output="true" returntype="array">

<cfargument name="text" type="string" required="yes">

<cfset var results=ArrayNew(1)>
<cfset var pos=1>
<cfset var subex="">
<cfset var done=false>

<cfloop condition="not done">


<cfset subex=reFind("href=""http://(.*?)""", arguments.text, pos, true)>

<cfif subex.len[1] is 0>
<cfset done=true>
<cfelse>

       <cfif not listfind(arraytolist(results),mid(text,subex.pos[1]+6,subex.len[1]-7))>
       <cfset arrayappend(results,mid(text,subex.pos[1]+6,subex.len[1]-7))>
       </cfif>

<cfset pos=subex.pos[1]+subex.len[1]>
</cfif>
</cfloop>


<cfreturn results>
</cffunction>

<cfoutput>


<cfhttp url="http://www.mywebsite.com/sitemap.cfm" method="GET"></cfhttp>


<cfset URLArray = FindURLs(cfhttp.FileContent)>


<cfset SearchData = querynew("title,key,body,custom1,custom2,URLpath")>


<cfloop from="1" to="#arraylen(URLArray)#" index="i">


<cfif not URLArray[i] contains "checkLogin.cfm">
   <cftry>
   
   <cfif URLArray[i] contains "?">
      <cfhttp url="#URLArray[i]#&search=Y" method="GET"></cfhttp>
   <cfelse>
      <cfhttp url="#URLArray[i]#?search=Y" method="GET"></cfhttp>
   </cfif>

   
   <cfset startpos = find("<title>",cfhttp.filecontent,1)>
   <cfset endpos = find("</title>",cfhttp.filecontent,startpos)>
   <cfset tmpTitle = mid(cfhttp.filecontent,startpos+7,endpos-startpos-7)>

   
   <cfset queryaddrow(SearchData)>
   <cfset querysetcell(SearchData, "title", "#tmpTitle#")>
   <cfset querysetcell(SearchData, "key", "#URLArray[i]#")>
   <cfset querysetcell(SearchData, "body", "#RemoveHTML(cfhttp.filecontent)#")>
   <cfset querysetcell(SearchData, "custom1", "")>
   <cfset querysetcell(SearchData, "custom2", "")>
   <cfset querysetcell(SearchData, "URLpath", "#URLArray[i]#")>

   
   <cfcatch type="Any">
   #URLArray[i]#
   <cfdump var="#cfcatch#">
   </cfcatch>
   </cftry>
</cfif>
</cfloop>

<cflock name="MyVerityLock" type="EXCLUSIVE" timeout="5">
   <cftry>
      <cfindex action="PURGE" collection="MyCollection">
      <cfindex action="UPDATE" collection="MyCollection" query="SearchData" type="CUSTOM" title="title" body="body" key="key">
   <cfcatch type="Any">
   Indexing Error
   </cfcatch>
   </cftry>
</cflock>
</cfoutput>

You will see in the code that as the script is crawling each page of the site, it adds "search=Y" to the URLs query string. I set up my sites so that if URL.Search equals "Y", the pages do not display the sites header, footer, or side navigation. This way my verity index only contains the content in the body of the page. By doing this, the verity searches return more accurate results. However, you do want to make sure that the <title> is still there, as that is used in the collection. Also, you will notice that I am stripping out the HTML from the content before putting it into the body field of my query. This makes it so Verity only indexes the actual text on that page, otherwise the verity collection would index the HTML tags too, If a user were to then search for "img", it would return every page with an <img> tag .

Also, you will see that I used an exclusive named cflock when updating the collection. I also put a read-only cflock (see code sample below) with the same name around the cfsearch tag on my sites search page. This way people can't search while the collection is being updated. This preserves the integrity of the index. Verity collections can easily get corrupted when you are reading and writing to them at the same time.

<cftry>
   <cflock name="MyVerityLock" type="READONLY" timeout="1" throwontimeout="Yes">
      <cftry>
         <cfsearch name = "searchResults" collection = "MyCollection" criteria = "#variables.crit#">
         <cfcatch type="Any">
            <b>The search criteria you entered contains invalid characters and/or parameters.</b>
            <cfset searcherror = 1>
         </cfcatch>
      </cftry>
   </cflock>

   <cfcatch type="Lock">
   <b>Our search index is currently being updated please try again in a few moments.</b>
   <cfset searcherror = 1>
   </cfcatch>
</cftry>

The script usually needs a little tweaking to tailor it to a particular site. For example, you may have noticed in the code that I had a conditional statement preventing the log in page from being indexed. Once you have the script indexing your site the way you want it, you would then add a ColdFusion scheduled task to execute this script as often as is necessary for your site.

Comments (6) | 8927 Views

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)

[Add Comment]

Thanks for this article.

This part especially:
=====
You will see in the code that as the script is crawling each page of the site, it adds "search=Y" to the URLs query string. I set up my sites so that if URL.Search equals "Y", the pages do not display the sites header, footer, or side navigation. This way my verity index only contains the content in the body of the page.
=====
is SO smart!

I found a similar solution for the indexing via sitemap but ended up jumping through a few hoops to strip out everything before and after the main content area of the page, using a comment in the code. Of course, without that comment I'd be out of luck on any given page. Very cool solution here.

Also appreciate the info about the lock and possible corruption. I will be sure to revisit this code when I do my next verity-via-sitemap setup... soon!

# Posted By Michael Evangelista | 11/30/07 3:05 PM

This code looks great and I would really like to try it out as a standard in my coding for verity search.

What should the sitemap.cfm page contain?

Just a list of links to pages on the site like below?

eg: <a href=index.cfm>home</a>
<a href=index.cfm?pageid=2>about us</a>
<a href=index.cfm?pageid=3>products</a>
<a href=index.cfm?pageid=4>services</a>
<a href=index.cfm?pageid=5>contact us</a>

I look forward to your reply.
Many thanks in advance.

# Posted By Jason | 11/3/08 6:55 AM

@Jason,

This script is set up to read a sitemap where all the href attributes in the links contain full urls.

<a href='http://www.mysite.com/index.cfm' >home</a>
<a href='http://www.mysite.com/index.cfm?pageid=2" target="_blank">http://www.mysite.com/index.cfm?pageid=2' >about us</a>
<a href='http://www.mysite.com/index.cfm?pageid=3" target="_blank">http://www.mysite.com/index.cfm?pageid=3' >products</a>
<a href='http://www.mysite.com/index.cfm?pageid=4" target="_blank">http://www.mysite.com/index.cfm?pageid=4' >services</a>
<a href='http://www.mysite.com/index.cfm?pageid=5" target="_blank">http://www.mysite.com/index.cfm?pageid=5' >contact us</a>

However in indexsite.cfm you can change cfhttp tag that reads the sitemap to set the resolveurl attribute to "yes" and then cfhttp will change all your relative links into full urls.

<cfhttp url="http://www.mywebsite.com/sitemap.cfm"; resolveurl="Yes" method="GET"></cfhttp>

# Posted By Scott Bennett | 11/3/08 1:08 PM

Cool! Thanks Scott.

Hopefully this provides me with a effective solution.
I'll post back and let you know how it works or if I have any other questions.

# Posted By Jason | 11/3/08 8:16 PM

Hey Scott,

When testing the search, it doesn't seem to be searching the contents/body of the pages. It only returns results where the search term used matches what is in the page s <title></title>.

How do I get it to search the body as well?

Also, I cannot display what is stored as "body" and "URLpath".

Sorry I am sounding like such a newbie... this is my first time using Verity, normally I use queries across multiple tables, which is fairly slow.

Thanks in advance for all your help.

# Posted By Jason | 11/4/08 11:49 PM

Thanks for that to work on new ideas, ColdFusion perfectly complements Google!
Welcome to the site http://www.queentorrent.com
Here you can download a lot of interesting information.

# Posted By Katty Lee | 7/8/09 3:53 PM

[Add Comment]

Get Expert ColdFusion Help

If you are looking for an experienced, professional, and reliable ColdFusion development company, contact PALADEM today.

Archives By Subject

AJAX (24) [RSS]
BlogCFC (2) [RSS]
CFEclipse (4) [RSS]
ColdFusion (50) [RSS]
CSS (1) [RSS]
FireFox (1) [RSS]
Frameworks (1) [RSS]
Google API (1) [RSS]
JavaScript (25) [RSS]
Lucee (1) [RSS]
My Life (6) [RSS]
OCCFUG (2) [RSS]
PayPal (1) [RSS]
Regular Expressions (1) [RSS]
XML (4) [RSS]

Enter your email address to subscribe to this blog.

RSS