blue       earthtones
Miscellaneous Ramblings

Bookmark and Share

I've summarized some cross-browser, performance, and related observations from past and current projects on this page, and am sharing some thoughts on them below. Many of you have no doubt already explored these issues, but perhaps they can save others a few minutes duplicating my experiments.

Note: Many of these topics were first written circa 2005, and are now in 2011 growing a bit "long in the tooth". I'm updating these to reflect HTML-5 and CSS 3 models as time permits.
 

Viewing 'raw' XML and XSLT with built-in XML Viewers

Discussion: All XML viewers are notably different, and equally somewhat selective regarding what/how/if they'll format and display raw XML, particularly depending on whether the XML header encoding is UTF-8 or UTF-16 (Windows "Unicode").

Firefox

Firefox is clearly the developers' browser and it will make every attempt to format and display raw xml in its XML Viewer even when there is some conflict between the element data, XML header, and format on disk; and equally, diagnose and report missing closing tags and similar errors. Since I programmatically emit XML, I've found it simplest to use Firefox throughout the development cycle to avoid getting sidetracked by browser quirks. It employs the readily recognizable black  character in place of any conflicted data, consequently it's the only browser I use to check pre-production XML.

Internet Explorer

Internet Explorer has perhaps the most rigid (read: unforgiving) XML viewer from the developer's perspective. Its viewer will abort XML display and report an error on any-and-many conditions. Principal among these is if there is any element content that it considers "inconsistent" or in conflict with the specified encoding (I can take either side of that argument). But said another way, IE does not gracefully handle the entire range of possible characters when XML file encoding is specified as UTF-8 (which is the canonical encoding for XML). This is not entirely a criticism, but the problem nevertheless surfaces with certain scientific notation, Greek character entities, "vulgar fractions", and other content often present in scientific data which has been directly exported from other Microsoft products as XML elements (if not employing CDATA markers -- which is problematic for other reasons and generally to be avoided in XML). In view of this, I've found that using IE to 'final-proof' XML files is helpful, since it will quickly expose (and often diagnose) any compatibility issues. My experience is that "If it works with the IE XML Viewer, it will work with anything".

Opera

Opera is based on Opera Software's proprietary Presto engine. Opera has rigorously passed Acid-3 testing though I have admittedly undertaken only basic testing with Opera's built-in XML viewer (version 10.00) -- without any reportable glitches. Opera formats the XML view somewhat differently than IE or Firefox, but it's nevertheless quite readable. I'm generally agnostic with regard to browsers (other than a personal preference for Firefox and IE in that order for reasons noted above), but Opera is fully worth testing with XML since it has evolved independently since it was first adapted as the basis for WebKit.

Safari (and Chrome)

Both Safari and Google Chrome are based on WebKit to render web pages, and neither — surprisingly — provides a built-in XML viewer. Frankly, I find this disappointing since neither displays XML adequately; both simply strip the "unrecognized" element tags, and render the .xml file as one exceedingly long string. If one wishes to view raw XML in either of these browsers in any remotely intelligible way, it's necessary to load the XML, find a whitespace on the page, then right-click, and 'view source'. This is frankly just too primitive to make them useful for viewing XML. Some might argue that plugins are available; I would argue that they shouldn't be required.

Solution:

Proof XML files with Firefox, IE, and Opera -- in that order. The simplest solution for a cross-browser test of XML (albeit not the only one) is to encode XML files which have extended characters with UTF-16 encoding. This works equally well for IE, Firefox, and Opera, and consequently absolves almost all cross-browser issues.

Notes:

1 - to re-save an emitted file as UTF-16 in Windows Vista or 7, open the file with Notepad or WordPad and "Save as" unicode. Unix offers a wide range of open source tools to accomplish the same result.

2 - In general, if you need to display raw XML within a web page, the simplest method is to invoke a Viewer by embedding an <iframe> in your page with the src=relative_path_to_your.xml. This invokes the browser's built-in XML viewer by default. (And if the file has an associated XSL, it will, further, transform the XML as indicated in the style sheet. You can see an example of this here).

3 - Excel (and essentially all Microsoft products) use UTF-16 encoding internally. If you're emitting XML data from any MS Office Pro product, use:

<?xml version="1.0" encoding="utf-16"?>

for your XML header to avoid cratering the IE Viewer when casually emitted characters cannot be normalized to UTF-8.

'Reset CSS' — Cross-Browser Baseline Compatibility

Most web developers will no doubt have observed — and consequently dealt with, in one way or another — the differences in the default behavior (style properties) of the "A-Grade" browsers and their respective rendering engines:

  • Internet Explorer (Trident)
  • Firefox (Gecko)
  • Safari (WebKit)
  • Chrome (WebKit)
  • Opera (Presto)

One practice used to "level the playing field", that is, set a baseline from which browser rendering may be uniformly predicted, is to start from a linked stylesheet which resets the default behavior of all browsers to a uniform state. Such a style sheet is generally termed a "reset css". Naturally, there are pro's and con's to this approach, and I'll share my parochial thoughts on these here.

A 'reset css' is most commonly the first linked style sheet in a given page <head>, which will then be cascaded or overridden by any following, site-specific style sheets. It contains a very wide range of selectors and default properties to reset all of the rendering engines (browsers) which might be expected to load the page. Some of the more popular Reset CSS stylesheets may be found at:

  • Yahoo Reset Stylesheet (the most concise, minimized reset stylesheet)
  • Eric Meyer's (perhaps the otherwise most prolific) Reset Stylesheet
  • HTML5 Reset Stylesheet

and there are many more to be found with a simple web search. But what are the, admittedly subjective, Pros and Cons?

Pros: A Reset Stylesheet sets a uniform stage for rendering your web pages across the A-Grade Browsers.

Cons: Performance Overhead ( it requires yet another HTTP request and subsequent parsing overhead)

That being said, I don't typically (though I'm not in any way opposed to) using a Reset Stylesheet for a few reasons:

  • Performance: loading a reset stylesheet typically requires another <link> in the <head> section of every page (or an @import in you base stylesheet), either of which which directly translates to one more HTTP request to load another source file from the server, parse it, and even if cached, subsequently iterate through the many selectors to set the reset styles before beginning to load and process the specific styles for the site/page;
  • Redundancy: Since I set the properties for just the subset (i.e., each of the selectors I plan to use on a given website) in my "base" css, loading a reset stylesheet first is redundant browser burden and consumes unnecessary network time and client-side cycles.

Consequently there's no "magic bullet" regarding cross-browser compatibility through reset stylesheets — they are tremendously apropos in some applications and redundant in others. You'll need to analyze each requirement in order to draw your own conclusions as to which which approach is most appropriate in your environment.

Performance Best Practices

Performance Overview: It just plain blows my mind when I go to a 'major' website and while impatiently waiting for it to render, notice the status bar in my browser indicates "65 items remaining". Geez Louise! guys and gals -- haven't you heard of images sprites, minimization, and other Best Practices?

Page load speed is a crucial consideration when designing and developing web pages, particularly for high volume sites with millions of page views per day. Regardless of the number of servers in a web farm, each visitor experiences a one-on-one browsing experience between their browser and one server for each page viewed. Consequently, it's imperative to take every step possible to develop each page to be responsive — enhancing the browsing experience and encouraging visitors to both stay and return.

There exists a rich body of published Best Practices which address the requirement of reducing page load time from the viewpoints of css, JavaScript, and particularly, images. But the fundamental thread throughout all of these decomposes to two maxims:

  1. reduce the size of downloaded components (minimization)
  2. reduce the number of HTTP requests
  3. load JavaScript at the *end* of the document

Perhaps the most expedient techniques to improve page load speed revolve around images since images offer rampant opportunities to respond to both objectives noted above. Exploiting the significant performance improvement which can be gained from image sprites for both new and existing pages, and optimizing image size for the web can deliver startling results.

image courtesy of tutorial9.net

Click here for a discussion and examples of using image sprites with HMTL 4.1, XHTML 1.0, and HTML5. The topic of Best Practices for page performance is a "work in processs" on this site. I'll continue to expand this topic as time presents itself, with discussions of best practices for css and JavaScripl.


Escaping quotes in JSTL sql:query with fn:replace

I was recently generating dynamic SQL INSERT and UPDATE queries with JSTL for a back-end MySQL database whose content derived from a dozen text and textarea inputs in an html form. Consequently, the data might (and in fact did) contain numerous embedded apostrophes and both single and double quotes in the content.

I did a quick web surf to find an answer. Although I found a fair amount of discussion, I found little in the way of a cogent solution. The confusion appeared to revolve around the issue of composing the substitution literals for the fn:replace function.

Solution: I tossed the search results and started from basic MySQL principles:

  1. If one encloses value data in double quotes, then embedded single quotes and apostrophes are perfectly valid and interpreted as intended;
  2. If one encloses value data in single quotes, then embedded double quotes are perfectly valid and interpreted as intended;

Given that, either approach immediately cuts the problem by half. I arbitrarily chose to enclose my 'column values' in single quotes. Consequently, embedded double quotes were valid and unremarkable.

Next, handle the apostrophes and single quotes in the data:

After a few experiments with Tomcat 5.1, I'd suggest that you forget trying to code literals in the fn:replace() function. Just use variables instead. Here's the JSTL solution that immediately worked for me:

	<c:set var ="apos" value="'" />
	<c:set var ="escApos" value="\\'" />

	<c:set var="escapedContent" value="'${fn:replace(parameter.value, apos, escApos)}'" />
      

(Note that the bounding single quotes enclose the data for MySQL.)

Rounded Corners with Various Techniques

Rounded corners on block elements lend elegance to layouts, and were consequently a subject of much discussion between creatives and developers prior to the availability of CSS 3. Many alternative solutions were developed, and with a little tweaking, the examples below all quickly worked on all A-Grade browsers.

CSS 3.0 border-radius and box-shadow

CSS 3.0 brought us the specifications for rounding corners using standard css styling. However, the implementation of border-radius and it's derivative elements has been slow to come to IE. IE 9 apprears to have resolved this; Safari and Chrome (WebKit) and Opera (Presto) deliver expected results, as does Firefox (although I haven't tested FF 7.0 to determine if "vendor specific" properties as shown are still required). Only Safari and Chrome, which are based on WebKit, and Firefox at or above version 4.0.1 will additionally render a box shadow on the second panel.

  width: 200px;
  height: 150px;
  border-radius: 10px;
  -moz-border-radius: 10px; 
  border: 1px solid #666666;
  background-color: #cccccc;
  width: 300px;
  height: 150px;
  border-radius: 25px;
  -moz-border-radius: 25px;
  border: 2px solid #eeeeee;
  box-shadow: 3px 3px 7px #777;
  -webkit-box-shadow: 3px 3px 7px #777;
  background-color: #cccccc;

Other Corner Rounding Techniques

Beyond the obvious solutions, notably including Flash, other approaches to corner rounding can be loosely organized into four categories:

  1. custom image positioned below a block element with z-index
  2. reusable corner images (gif or png) assembled with css classes and xhtml
  3. pure css-only classes with no images, painstakingly assembled with xhtml
  4. JavaScript/jQuery-driven corner rounding

Example - Category 4

This example uses the jQuery JavaScript "corners" plugin to round block elements.

JavaScript must be enabled in the visitor's browser for this corner rounding technique to work.

Each technique clearly brings its own strengths and weaknesses. But, in the final analysis, we're all eagerly embracing the css 3.0 implementation in the Trident (IE-9 only), WebKit, Presto, and Gecko rendering engines, albeit with vendor-specific properties in many cases. The examples above work in Firefox, IE, Safari, Opera, and Chrome (which make up over 99.9% of visitors to my sites).

The rounded corners on the animated gif at right are a simple illusion created by specifying a conventional (rectangular) animated gif as the background-image in css with the div's content consisting only of a rounded transparent gif.

Cross-Browser Rendering of Data Sources That Contain Embedded HTML

Discussion: Although I discourage the practice of storing data with embedded HTML for a variety of reasons, it is nevertheless necessary at times for a web developer to render html elements from existing XML, JSON, or database columns that contain embedded HTML.

Why would anyone store data with embedded HTML? Consider one example -- a database column whose content requires a wide range of Greek characters, but is stored in a database whose encoding cannot store the entire set of Unicode characters needed to reproduce them (for example, collation = Latin1 or ASCII). One database I've dealt with embedded them in the column data as HTML character entities instead, e.g., &epsilon; for the Greek character ε. Consequently, I found it necessary to render the column consistently across-browsers by manipulating innerHTML.

This requirement can be readily achieved by shifting the element's text content to its innerHTML. As a starting point, the basic Javascript construct is entirely straightforward:

var target = document.getElementById('testField');
target.innerHTML = target.textContent;

However, achieving a cross-browser solution is more complex. The code shown above will work as expected with Firefox — but not with other A-grade browsers since they don't recognize the .textContent property.

  IE 8 FF 3.5 Opera 10.5 Chrome3.0 Safari 3.0
Trident Gecko Presto WebKit WebKit
.innerText Yes No Yes Yes Yes
.textContent No Yes No No No

So for other browsers, it is simply necessary to substitute a different property. The equivalent code is:

var target = document.getElementById('testField');
target.innerHTML = target.innerText;

Solutions: There are two cross-browser solutions for this requirement. I've implemented both approaches on production websites for different reasons at different times, so "Choose your poison":

Solution 1:

It's not really necessary or efficient to perform a rigorous browser check for this issue alone. It's only necessary to determine which property the browser recognizes. This can be easily determined by testing if one or the other delivers an undefined when used:

var testInnerText = (document.getElementsByTagName("body")[0].innerText !== undefined) ? true : false;

Wrap the above in a <script ...> envelope and place it just before the ending </body> tag (since <body> must be defined before it can be tested). This will produce a variable (testInnerText) which can be interrogated in later scripts and functions to establish the browser capability, for example:

var target = document.getElementById('testField');
target.innerHTML = (testInnerText ? target.innerText : target.textContent);

Solution 2:

There is another DOM property which is uniform across browsers, and can be read or written to manipulate the text contained in a node (HTML element). In the following example it is read to move content which contains embedded HTML to that element's innerHTML, where it will be correctly rendered. This property is the nodeValue property. Since it's universally recognized by all A-Grade browsers,it eliminates any requirement for a browser test.

So, the equivalent solution using the nodeValue property is simply:

var target = document.getElementById('testField');
target.innerHTML = target.firstChild.nodeValue;

Conclusion: In view of the above, the complete table for both approaches is:

  IE 8 FF 3.5 Opera 10.5 Chrome3.0 Safari 3.0
Trident Gecko Presto WebKit WebKit
.innerHTML Yes Yes Yes Yes Yes
.innerText Yes No Yes Yes Yes
.textContent No Yes No No No
.nodeValue Yes Yes Yes Yes Yes

Cross-browser Page Centering

Centering the content of a web page or an entire website on a <body> background is simple for a given browser. But of course there's more than a dozen rendering engines (including notables Trident, WebKit, Presto, and Gecko), so it takes a dual-pronged approach to achieve a generalized cross-browser page-centering solution.

The example below assumes a <body> with a css-specified background color or image. The objective is to 'float' the site's pages in the center of the browser window regardless of the window or monitor size:

  1. IE (Trident) works as might be anticipated with style="text-align: center"; (but remember that it's subsequent enclosed layers (div) that are centered)
  2. Firefox (Gecko), Safari and Chrome (WebKit), and Opera (Presto) deliver the same results, but only with style="margin-left: auto; margin-right: auto";

So, a generalized cross-browser solution to center-float the inner content of the body regardless of window size simply uses both:

a) In your css:

  1. specify the background-color or background-image in the default <body> tag,
  2. create an id for an over-arching container with the css properties:

        #parentContainer {
             margin-left: auto;
             margin-right: auto;
             text-align: center
             }
(if you'd like uniform inner layer width, add:)
#bodyContent { width: 780px; }

b) in your template or markup:

  1. place a container ( <div id="parentContainer"> ) immediately below the <body>
  2. all subsequent layers should lie within this parent container, which gets closed just before the </body> tag

in other words, the markup on every page you need centered employs this envelope:


              <body>
                  <div id="parentContainer">
                      <div id="bodyContent">
                          the rest of your markup containered here
                      </div>
                  </div>
              </body>
            

note that this does not disturb the behavior of absolute and relative positioning of elements within the container.

Relative and Absolute Positioning

The css position property, especially when combined with the z-index property, provides all of the flexibility you might want to achieve a perfect pixel-positioned layout. However, the interactions between relative and absolute positioning are not immediately apparent and take a little experimentation to grok. Let's consider those interactions in this discussion.

"position: absolute" means just that — if you casually specify absolute positioning for an element within the <body>, it will hang at the same absolute position, typically displaced in a signed polar direction from the left, top origin of the browser window (i.e. the <body>), regardless of the window/browser/monitor size, and regardless of how you resize the window. If this is not your goal (and it usually isn't), then simply enclose it within a block element (e.g., a <div>) which lies in the normal flow and whose class/id contains "position: relative". In this event the child's 'absolute' position will be conveniently relative to its parent layer's top left origin.

Said another way, Unless you're really trying to pin an element to a fixed location, a "position: absolute" element should lie within a "position: relative" block element.

Expanding on the example used earlier in the Page Centering topic, experiment with:

css:


        #parentContainer {
             position: relative;
             margin-left: auto;
             margin-right: auto;
             text-align: center
             }
#bodyContent { position: relative; width: 780px; text-align: left; }
#page-heading { position: absolute; left: 50px; top: -5px; }

Given that, it then follows that the markup:


          <body>
              <div id="parentContainer">
                  <div id="bodyContent">
                      <div id="page-heading">
                          An absolutely-positioned Heading
                      </div>
                      <div>
                          <p>other elements follow normal flow<p>
                      </div>
                  </div>
              </div>
          </body>
          

will place the page heading 50 pixels right of, and 5 pixels above, the left top corner of the parentContainer. You can randomly pixel-position any number of layers using this technique by adding additional classes or IDs to your css. The example above shows two animated gifs which are absolutely positioned within this jQuery pane and overlapped with z-index. Note that the images are separately specified in the preferred and alternate style sheets, so they'll also change with the color theme.

Depending on your needs for multiple <div> tags above (enclosing) an absolutely-positioned element, it may be necessary to make one or more of those parent layers "position: relative" to get the desired result.

But note that an important behavior of all rendering engines when positioning out of the normal flow applies differently to position:relative and position:absolute. Should you specify position:relative in css and further specify top and/or left values, the space formerly occupied by the re-positioned node remains reserved and may consequently display as an open/blank space; to solve this, instead position:absolute within a position:relative parent block element so that the space is "released" and no 'blank' space is rendered.

   

Variable Height <iframe>

Although I haven't used frames in years, I continue to come across occasional but perfect applications for iframes (inline frames). High among these is to invoke the built-in XML viewer of Firefox, IE, or Opera (but note that you must relax your XHTML header from 'strict' to 'transitional to validate a page which contains an iframe.)

That being said, one of the principal issues I (and just about every developer) has battled with iframes is the fixed "height" limitation. If you don't require the automatic invocation of an XML Viewer or XSL formatter, then with the advent of the XMLhttpRequest object, it became a simple matter to substitute another solution — which doesn't suffer from the fixed-height limitation.

Solution: Consider instead using an XMLhttpRequest object response loaded into the text node of a named <div id="xxx"> — which will automatically expand to the necessary height. This approach is used liberally elsewhere on this site, but in brief, simply retrieve a text or html file from the server and stuff the .responseText into the id.innerHTML (or id.childNodes[0].nodeValue DOM object) of the targeted div.

Other Tactics: A common discussion you will find in any web search on the subject of variable height iframes invariably illustrates the use of JavaScript to dynamically resize the iframe. I have personally tried and found each of these solutions to be lacking in one or more ways, and have invariably avoided using js to 'solve' the iframe height problem.

However, depending on the reason you're finding it necessary to vary the height of an iframe, many other approaches may work. Without embarking here on either of the Great Debates regarding CSS Hacks or Reset CSSs, I periodically find myself tasked to resolve cross-browser issues related to iframe height at low cost (read: minimum time).

Among the Causes: If you are not using either a Reset CSS, or specifying margin, padding, line-height and other properties whose defaults vary between browsers (most notably between IE, Firefox, and Safari), and consequently munge your rendering with unintended scroll bars, consider:

  1. obtaining and loading one of the recognized Reset CSSs; -or-
  2. adding the necessary properties to your 'base' css to level the playing field; -or-
  3. adding a CSS hack or two to 'fix' the problem by varying the iframe height between browsers.

Here's a live example of approach #3 from a site which did not use a Reset CSS and specified only some of the necessary properties. All that it required was a one line 'hack' to the css for IE6 - 8:

  1. First, delete the height property from the iframe tag
  2. Then, declare the following heights in css:
    iframe {
    	height: 2625px;    /* Firefox */
    	height: 2350px\9;  /* IE < 9 */
    }

Naming Conventions and Style Guides

As a freelance UI developer, one of the things I find myself continually doing is taking a "reset" on client lab conventions regarding naming. This applies equally to css classes and IDs, html/xhtml node names, JavaScript function names, and others.

For all practical purposes, nearly everyone employs lower camel case for JavaScript funtions, but there are at least four approaches regarding the naming conventions used in css classes and IDs and xhtml nodes that I've observed. If you were to read the css sheets from the sites in my portfolio, you'd no doubt notice a range of conventions. They are:

  • lower Camel Case
  • upper Camel Case
  • hyphenated name
  • underscore separated

Let's take the example of a css class, generically " left column", which is a hypothetical float:left column on a page. The class or id might be variously named (respectively):

  • leftColumn
  • LeftColumn
  • left-column
  • left_column

Observation: My first rule is: "Consistency beats subjective quality every time". If the shop convention is 'hyphenated', then it's far more productive to adhere to extant local convention and name the class "left-column" despite any personal preferences you might have. If you're a contractor, it's important to identify these (and other conventions) when commencing a new engagement; if you're a new full-time employee, then it is equally important to obtain, learn, and adhere to the corporate/site Style Guide -- or offer to produce this material if, as is often the case, a Style Guide doesn't exist.

Revising any extant convention to something arguably more appropriate is non-trivial, but nevertheless possible given the power of tools such as Dreamweaver. But with that being said, decisions regarding pervasive changes such as these should not be taken lightly since they affect esssentially every page — and every developer — on a site and its branches, are costly, must first be carefully considered, and then funded as a distinct 'project' with many deliverables. Parenthetically, I'm a hand-coder, but would enthusiastically use tools such as Dreamweaver to accomplish a project of this nature at affordable cost.

Style Guides

Not intended as a monograph on UI Style Guides, here's nevertheless a few pivotal issues to keep in mind regarding style — and to understand thoroughly before embarking on page, template, or css development. A web development Style Guide must clearly indicate much more than simply naming conventions; beyond that it must also clearly specify peripheral detail including:

  • the corporate colors in hex notation
  • the corporate font-family, font-size, and related conventions
  • registered® and common-law™ Trademarks
  • required trademark marking
  • an intranet-based library of images and logos approved by corporate counsel
  • trademark image source location on the development intranet

Browser Trends and Display Resolution

In order to confirm which browsers to cross-check, I periodically collect statistics from some live web sites and watch for trends. Here are some interesting browser statistics from July 2009 and 2010.

Browser Statistics
  site IE7 IE6 IE8 Firefox Chrome Safari Opera
July 2009 15.9% 14.4% 9.1% 47.9% 6.5% 3.3% 2.1%
tgz 37.4% 14.6% 22.8% 16.3% 0.5% 6.4% 0.2%
 
June 2010 8.1% 7.2% 15.7% 46.6% 15.9% 3.6% 2.1%
tgz 17.3% 5.9% 44.1% 19.2% 4.6% 7.4% 0.5%

The W3C's extensive statistics are a great resource found here. Naturally, the visitors to W3C's website tend to be more technically savvy than the general population, so those stats reflect their browser choices. Conversely, visitors to theGeoZone.com ("tgz" in the above table) tend to be on the other end of the bell curve, so reflect a microcosm of the general population. In general, we see a slow, but steady drift from IE (all versions) to Firefox, and to a smaller extent, Chrome. Safari visitors were mostly using Macs..

Display Resolution

The current trend is that most computers are using a screen size of 1024 x 768 pixels or larger. The W3C's trend data can be seen here. Prior to 2008 we designed largely for 800 x 600 pixels; since that time we've moved to 1024 x 768 pixels as a baseline (and we regularly see 1280 x 1024 and larger in our web stats).

Apache Configuration Notes — Blocking Rogue Spyders and setup.php Penetration Attempts

Discussion: I recently spent a full day tweaking up the behavior and performance of one of my Apache/Tomcat websites, and found myself reading through a ton of (useful) reference material from apache.org and some other webmasters faced with similar needs. Among things, my mission included blocking the BaiDuSpider since it blatently ignores robots.txt and indexes things that are there for the convenience of my web development team and not intended to be indexed.

Blocking BaiDuSpider: In order to save you a non-trivial amount of time researching it, the BaiDuSpider (a spyder used by Chinese and Japanese search engines), visits from too many ip addresses (which you can nevertheless find in a web search) to make blocking their specific ip addresses a practical solution. I normally attempt to block rogue spyders by USER_AGENT name instead. But that being said, there are a few variations in capitalization to deal with, and I expect more will arise as time goes on.

Solution: Since I have no Asian content or language support on the target website, and would prefer it not be indexed in Japan or China, I simply blocked the ill-behaved spyder in total using rewrite rules. Parenthetically, it's more conventional to put rules in the httpd.conf file, and more efficient not to use .htaccess at all for a variety of performance reasons particularly those related to the dynamics of determining inheritance of .htaccess restrictions. But if it's impractical to edit the base httpd.conf file — as it was with this ISP — then .htaccess represents the simplest solution.


Blocking BaiDuSpider in .htaccess:  Place the following rewrite rule in the .htaccess file in your website's root directory:


RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^baiduspider [NC]
RewriteRule .* - [F,L]	   
	   

but first be very sure you have mod_rewrite enabled in the DSOs in apache's httpd.conf! You should find (or add) this line:


LoadModule rewrite_module modules/mod_rewrite.so
	  

Consequently, BaiDuSpider now receives an HTTP: 403 - Forbidden response whenever they request any resource from the target website. Perhaps some day BaiDu will learn to observe the directions in robots.txt; or perhaps, arrogantly, not. In either event, I haven't seen BaiDuSpider in several weeks now, but have seen the hundreds of resulting HTTP:403 entries in my logs.


Blocking Purebot: Purebot is a surprisingly lame spider, not the least of which odd behaviors is that it leaves bizarre 404's behind, making it easy to spot in your error log: Purebot will relentlessly request non-existent paths with dozens of erroneously repeated nodes in the path. Try adding the following rewrite rule to the .htaccess above:


RewriteCond %{HTTP_USER_AGENT} ^(.*)purebot(.*) [NC]
RewriteRule .* - [F,L]	  
	  

This appears to be working on my sites, which now decline to serve any request from Purebot with an HTTP: 403 - Forbidden result.


Blocking setup.php Exploits: While I'm on the subject of rewrite rules, another mission was to block the dramatically increasing number of penetration attempts to /phpMyAdmin.../setup.php. I've recently (June 2010 and beyond) seen several hundred HTTP:404 entries in my logs every month for many variations of the path to setup.php, which originate principally from Russia, Guatamela, British Columbia, and the Phillipines. Adding these simple RewriteRules to the code above moved all of them from 404: Not Found to 403: Forbidden.


RewriteRule ^(.*)phpmyadmin(.*) - [F,NC,L]
RewriteRule ^(.*)setup\.php$ - [F,NC,L]
	   

Bear in mind that I cannot urge you strongly enough to thoroughly test any changes to .htaccess on your development or staging server before putting it into production! An .htaccess file presents a binary proposition — it is either precisely correct, or categorically wrong. And it will completely hose the behavior of Apache, Tomcat and other engines when it's wrong.

Efficiently Transferring Arrays Between Microsoft Excel Worksheets and Visual Basic

Discussion: I was recently investigating the encryption algorithm behind the classical Jefferson Wheel Cipher and decided to use Excel coupled with a VBA module as a tool to to fast prototype, implement, and explore its behavior. I 'knocked out' a spreadsheet and VB module which faithfully implemented the algorithm — however I was startled at how slowly it ran.

With a little profiling, it quickly became apparent that transferring individual Cells() from a worksheet to a VB array is a slow process— and transferring individual Cells() from a VB array back to the worksheet is a *really slow* process. My initial approach had been simply to use the convenience of nested For/Next loops to retrieve the clear text and subsequently return the cipher text, transferring one character (Cell) at a time between the worksheet and a VB array. But with only a small array (26x16 cells, or 416 total cells), the roundtrip execute time was an astounding 20 seconds.

Solution: Abandoning the convenience of the .Cells() property (which accomodates simple calculation of the row and column as variables), I substituted a Range() assignment with a precalculated string parameter denoting the entire range. Transferring the 2D array as a single object executed about 20x faster — less than one second!

One difference in the supporting code is that regardless of the Option Base specified in the VB module, the array variant containing the cells transferred from the worksheet inevitably behaved as if "Option Base 1" was declared. In my case, it required declaring Option Base 1 to eliminate the constant mindset adjustment between Option Base 0 for my VB arrays and Option Base 1 for the objects transferred from/to the worksheet.

Starting point snippet (all-caps are convenience constants defining the location of the worksheet array):


	Dim arrayVariant (ROW_COUNT, COL_COUNT)
	
	For row = TOP_LEFT to (TOP_LEFT + ROW_COUNT)
		For col = LEFT_COL to (LEFT_COL + COL_COUNT)
			Worksheets(SHEET_NAME).Cells(row, col).Value = arrayVariant(row, col)
		Next col
	Next row
	   

This solution snippet executed 20x faster:


	Dim arrayObject as Variant
	Dim responseRange as Variant
	responseRange = Chr(Asc("A") + TOP_LEFT - 1) & Format(LEFT_COL) & ":" & Chr(Asc("A") + COL_COUNT - 1) & Format(TOP_LEFT + ROW_COUNT)

	'then to load the array from the worksheet:
	arrayObject = Worksheets(SHEET_NAME).Range(responseRange).Value
	'or to move the array back to the worksheet:
	Worksheets(SHEET_NAME).Range(responseRange).Value = arrayObject