Friday, October 30, 2009

Commenting Code

Have you ever written code that reads from a BufferedReader? Suppose you read someone else's code that said

//There is a BufferedReader r
StringBuffer sb = new StringBuffer();
int i;
while ((i = r.read()) != -1)
sb.append((char)i);

What do you think ? will you change it to the more normal
int i;
char[] data = new char[1024];
while ((i = r.read(data,0,data.length)) != -1)
sb.append(data);
Edit: Embarassingly the above code is wrong, but the code is only illustrative , reading a character at a time v/s reading chunks of it

Which is more efficient? Which is more 'performant'?

And finally if you read the original code with an additional comment

StringBuffer sb = new StringBuffer();
int i;
// under JIT, testing seems to show this simple loop is as fast
// as any of the alternatives
while ((i = r.read()) != -1)
sb.append((char)i);
would you even bother?

Code snippet taken from ImportSupport.java - jakarta-taglibs-standard-1.1.2-src.

Thursday, October 29, 2009

Spidering a site with JMeter

Sometimes we need to check every link on the site and see that they all work, and this question came up a couple of times on the JMeter forum 'How do I use JMeter to spider my site?'
But before we go into the solutions, lets take a step back and see the reasons behind wanting to spider the site or skip to solution

a. You want to find out whether any urls respond with a 404. This isn't really a task for JMeter and there are various open source/free link checkers that one might use so there really isn't a need to run JMeter to solve this class of problems (http://java-source.net/open-source/crawlers for just spiders in Java. There are others too like Xenu or LinkChecker)

b. You want to generate some sort of background load and you hit upon this technique. A spider run with a specific number of threads will provide the load. While a valid scenario, this doesn't really simulate what the users are doing on the site. So it goes back to what are you trying to simulate?. It's much better to simulate actual journeys with representative loads. You might need to study your logs and your webserver monitoring tools to figure this out. It's tougher to do this, but it's more useful.

c. You want to simulate the behavior of an actual spider (like Google) and see how your site responds, whether all the pages are reachable. See a.

Other problems
A test without assertions is pretty much useless. A spidering test by its nature is difficult to assert (other than response code = 200! and perhaps the page does not contain the standard error message shown).

JMeter does not really provide good out of the box support for spidering. The documents refer to an HTML Link Parser which can be used for spiders which leads some users to try it out and complain that it doesn't work. It does(see this post) but not how you expect, and not as a spider (The reference manual needs to change).

Before we go on to trying to implement an actual Spider in JMeter, lets see some alternatives that we have (using JMeter and not a third party tool).
a. Most sites have a fixed set of URL's and a possible dynamic set e.g. a Product Catalog where each product maps to a row in the database. It is easy enough to write a query that fetches these (using a JDBC Sampler) and generating a CSV file that contains the URL's you want. The JDBC sampler is followed by a Thread Group (number of threads the spider will run) which reads each URL from the CSV. This is especially useful when you consider that it is quite possible that some links are not accessible from any other link in the site (This is bad site design, but exists , for e.g. FAQ are not browsable on my current site, they must be searched for which means that there is no URL from which the FAQ is linked to and a spider would never find them directly)

b. Some sites generate a sitemap (it may even be a sitemap that is used for Google) for the reasons mentioned above. It is trivial to parse this to obtain all the urls. A stylesheet can convert this into a CSV and the rest is the same as point a.

One last thing before we start discussing JMeter solutions. The first time I came to know anything about how spiders work is when I ran Nutch locally(and later refined with the knowledge of MapReduce).
In a simplified form
a. A first stage reads URLs that are pending from the buffer/queue and downloads the content. This is multithreaded but not too much so as to not bring down the site.
b. The second stage parses the contents for links and feeds it into the same buffer and queue.
c. A third stage indexes the content for search. This is irrelevant for our tests.
A related concept is that of depth. i.e. In how many clicks(minimum) does it take to reach the link from the root/home/starting point of the website.

Attempt 1.
Using the previous depth definition, most sites(because of menu's and sitemaps) need at the most 5-7 clicks to reach any page from the root page (kind of like Kevin Bacon's six degrees of separation). This implies that instead of a generic solution we could have a hardcoded solution which fixes the depth that we look at and use the time tested method of Copy - Paste.
Here's what this solution would like

The test plan is configured to run the Thread Groups serially
1. Thread Group L0 fetches all the urls listed in a file named L_0.csv. Each request is attached to beanshell listener which parses the response to extract all anchors and writes these anchors to a separate temp file. The code which does this is lifted from AnchorModifier and is accessed via a Beanshell script calling a Java class(JMeterSpiderUtil).
2. Thread Group L0 Consolidate (single thread) creates a unique set of all the urls from the temporary files created in step 1 and subtracts the urls already fetched from L_0.csv and writes these urls to a file named L_1.csv. This code is also in the Java class and is described below.
3. Thread Group L1 (multi thread) fetches all the urls listed in the file L_1.csv which was created in step. Each request is attached to beanshell listener which parses the response to extract all anchors and writes these anchors to a separate temp file.
4. Thread Group L1 Consolidate (single thread) creates a unique set of all the urls from the temporary files created in step 3 and subtracts the urls already fetched from L_0.csv,L_1.csv and writes these urls to a file named L_2.csv
... and so on for any number of levels/depths that you want.
If you are any sort of developer , you are probably groaning at the above. "Hasn't this guy heard about loops? What about maintaining these tests? Are we going to make any changes in 5 places ?".
We could use Module Controllers to reuse most of the test structure but it's still inelegant.
One of the reasons I've described the above is that even if the solution looks inelegant it is easy to understand and doesn't take time to implement, which means you can start testing your site pretty quickly. Note that your priority is the testing of the site , not the elegance of the testing script.

Attempt 2
Lets now see if we can increase the elegance of the script. One of the problems we run into is that the CSV data set config can't use variable names for the filename. Another problem is that in the solution above we run the Thread Groups serially and we use a single thread in a thread group to combine the results. If we want to use a single looped thread group we have to ensure only 1 thread does the combining which needs to wait for all the other threads to complete. You can probably simplify this solution by extending the CSV data set config or the looping controllers, I don't consider these approaches because I have no Swing experience at all , so the only ways I extend JMeter are via BeanShell or Java.
After some experimentation this is the solution that I've come up with


1. The Loop Controller controls the depth/level
2. The simple controller has an If controller that is only true for the thread with threadnumber 1. It defines the current level and copies the file L_${currentlevel}.csv to urls.csv
3. The wait for everyone is configured with a synchronizing timer (same as total number of threads in a threadgroup) so that all the threads are waiting till the first thread has finished in step 2
4. The while controller iterates over all the urls in the csv. The CSV Data Set is configured to read the copied urls.csv file (since we cannot make the name variable). What we will do in the subsequent steps is recreate this same file with new data. Each request is attached to beanshell listener which parses the response to extract all anchors and writes these anchors to a separate temp file. The code which does this is lifted from AnchorModifier and is accessed via a Beanshell script calling a Java class(JMeterSpiderUtil).
5. We have a copy of step 2 here, all the threads wait till everyone else is done (for that level only)
6. The If controller ensures that the consolidation is done only for the first thread, all the files written in step 4 are combined into a unique set, all the urls already processed are subtracted and a new file L_${nextlevel}.csv is written. Properties are set so that the ${currentlevel} now is the ${nextlevel} so that step 1 will now pick up this new file and copy it as urls.csv for the CSVDataSetConfig to pick up.
7. The Reset Property Bean Shell sampler is used to reset the CSV Data Set Config
FileServer server = FileServer.getFileServer(); // get the File Server
server.closeFiles(); // close everything
server.reserveFile("../spider/urls/urls.csv", null, "../spider/urls/urls.csv"); //reregister the CSV, we have chosen sharing mode as All Threads to avoid copying the alias name generation in CSVDataSet.java


This was run with a root of http://jakarta.apache.org/jmeter/index.html.
Only urls with jmeter in them were spidered and with the jakarta.apache.org host.
Level 1 - 17 urls
Level 2 - 29 urls
Level 3 - 125 urls
Level 4 - 833 urls
Level 5 - 2 urls
Level 6 - 0 urls
And I did get some failures too e.g.
http://jakarta.apache.org/jmeter/$next
http://jakarta.apache.org/jmeter/$prev
So I guess the test is successful because it found some issues!.
Which means there are no more urls that satisfy our criteria. You could change the loop to a while controller and use this condition to check whether or not the test should exit. However some sites generate unique urls (e.g. by appending a timestamp) which makes it possible that your test might not exit , so you should normally have a safety for maximum depth.

Is attempt 2 more elegant? Probably , but also less configurable and took about 2-3 days to get it working and needed some study of JMeter source code. Note that the previous solution could vary the number of threads available to each Thread Group but this can't. However by using the constant throughput timer , you can achieve variable throughput for different levels.

JMeterSpiderUtil.java
The major part of this code is from AnchorModifier
Important snippets are shown
if(isExcluded(fetchedUrl) ) //excludes stuff like PDF/.jmx files which cant be parsed
...
(Document) HtmlParsingUtils
.getDOM(responseText.substring(index)); // gets a DOM from the request
...
NodeList nodeList = html.getElementsByTagName("a"); //gets the links
...
HTTPSamplerBase newUrl = HtmlParsingUtils.createUrlFromAnchor( hrefStr, ConversionUtils.makeRelativeURL(result .getURL(), base)); //get the url
...
if(allowedHost.equalsIgnoreCase(newUrl.getDomain())) {
String currUrl = newUrl.getUrl().toString();
if(matchesPath(currUrl)) {
//currUrl = stripSessionId(currUrl);
//currUrl = stripTStamp(currUrl);
fw.write(currUrl + "\n");
}
}
//checks whether the host is the one we are interested in, whether the path is one that we want to spider, could strip out session ids or timestamp parameters in the url
...
Download Code
SpiderTest - Attempt 1.
SpiderTest2 - Attempt 2.
JMeterSpiderUtil - Java utility.

If you want to use the code
a. Ensure that the total number of Threads is specified correctly in both synchronizing timers (use a property)
b. Some directories are hardcoded. I used a directory named scripts under jmeter home, another directory called spider at the same level as scripts. Scripts has two sub directories temp and urls. L_0.csv the starting point is copied into urls.
c. If you want to rerun the test ensure you delete all directories under temp and all previously generated csv files in urls (except for L_0.csv.
d. You might have to change the java code to further filter urls /improve the code. The Jmeter path regular expression is hardcoded
e. You have to change the allowedHost , probably to an allowable list rather than a single value.
f. You probably have to honor robots.txt
g. You might want to check the fetch embedded resources or change what urls are considered to be fetched (currently only anchors no forms or ajax urls based on a pattern)

Note that the code is extremely inefficient and was only written to check if what I theorized in http://www.mail-archive.com/jmeter-user@jakarta.apache.org/msg27108.html was possible
There is a lot of work to properly parameterise this test , but hopefully this can get you started.

Code available here

Friday, October 23, 2009

Interview questions revisited

Ive been experimenting with webmaster tools and analytics for this blog, and while running a Google search , I came across
http://www.experts-exchange.com/Software/Server_Software/Application_Servers/Java/BEA_WebLogic/Q_24000475.html+weblogic+portal+interview+questions (Hint use Google Cache to see the answers )
And on the BEA forums I see
http://forums.oracle.com/forums/thread.jspa?threadID=919149&tstart=15
http://ananthkannan.blogspot.com/2009/08/weblogic-portal-interview-questions_29.html
http://venkataportal.blogspot.com/2009/09/comming-soon.html
Compared with my own
http://theworkaholic.blogspot.com/2007/02/weblogic-portal-interview-questions.html
http://theworkaholic.blogspot.com/2009/10/weblogic-portal-interview-questions-ii.html

There's a pretty big difference between the kind of questions I ask and the kind of questions people seem to think will be asked or indeed do ask. A multiple choice question? really? I guess that was picked up from the BEA certification exam. (The less said about certification the better). Is there a point asking people something that's right there in the documentation or something that any respectable search engine could?
Lets get some assumptions out of the way
a. A bad resource is extremely detrimental to any software project. The contribution is negative and a big negative at that. It is better to not have the resource than have a bad resource.
b. There isn't an easy way to eliminate a bad resource at a short listing phase.
In most cases there are more people applying to the job than there are jobs. The resume is too abused to be an effective eliminator. If you look at a typical Java/EE resume , every specification in the EE umbrella is covered. Everyone has solid knowledge and expertise in all the specifications. On project experience is sometimes faked.
Would a quick multiple choice easily corrected paper help? I believe that this is actually bad. The people who aren't that knowledgeable know it, and spend their time memorizing documents/api's etc before an interview and can probably game this test. The people who I know are good in their fields usually don't have much time or patience for the minutiae, but are quite capable of doing this on demand. Project Experience would be a good indicator, but it is costly to verify this before hand. References are usually given by friends and aren't reliable. Typically a interviewee isn't going to provide a reference to someone who will give him a negative review.

So we can't rely on the short listing process to eliminate the bad apples. You must as an interviewer go to an interview thinking that you might be gamed. This means that straightforward questions might be answered by a bad candidate. This doesn't mean that you should ask the brain teaser sort of questions which only indicate that the interviewee is good at solving brain teasers (or has Googled the answers).

What then constitutes a good interview question?
Here are my criteria
a. The interviewee must be able to describe what he has worked on /is working on effectively. he must be confident in the modules he has worked. He must be able to answer questions related to his module when you vary some of the parameters. This is a deal breaker. A person who doesn't know what his project probably wont be able to handle yours either.
b. Most of the technical questions I ask are conversational and to which there probably isn't a right answer. The question is just the opening gambit, E4 for chess players. If I feel I am getting a recitation from documents , I introduce a twist or change a parameter of the problem (e.g. if the answer is something like I would design this with Spring by utilizing Dependency Injection IOC pattern and use the Hibernate ... - would be met with sorry the spring/hibernate license doesn't meet the project requirements, you can't use it).
c. Hands on experience on the technologies Im looking for is always a great plus, but it isn't a dealbreaker for me. If you can handle JSP, you can handle JSF. If you can handle Struts, you can handle other controller frameworks. What I can't stand is when someone states about all the stuff he has worked on at the start, how he was the heart and soul of the entire project, the life of the party, later changes his tune to say well I didn't really work much on that particular part. Thats a deal breaker. Dishonesty means I can't trust any of the other wonderful things you said, bye bye.
d. Never ask code questions without also providing the books, the documents, the search engine and a compiler. Writing code snippets on a whiteboard is stupid. Pseudo code questions are perfectly acceptable. Don't ask people to reinvent sorting algorithms when there are so many books (When will I ever buy that Donald Knuth book) that they could use. If you want to check analytical skills then use real life examples. There must have been numerous problems with your project, describe the circumstances and ask the resource to make suggestions.

In some ways I'm glad that I don't have to conduct interviews anymore. The last time I was proudly telling my mother of how many people I have rejected, she said why am I depriving people from working , and that you don't know how much they might need the job. While I still stand by my assumption that no resource is better than a bad one, it's still disturbing to think that I might(probably) have made errors in judgement and maybe just maybe I rejected a deserving candidate and maybe just maybe he really needed it. Like I said I'm glad I don't make hire decisions anymore

Throughout this post, I have referred to the interviewee as 'him'. Thats probably due to that fact that more than 90% of the candidates I've interviewed are male. Which is a sad state of affairs for software.

Wednesday, October 21, 2009

First Weblogic Portal Pro


I'd like to thank.....This shouldnt give me that much happiness, but it does.

Tuesday, October 20, 2009

Weblogic Portal interview questions - II

The following are the Portal interview questions that ive used or kept or have been asked(in no particular order , and no answers either :) )
I do not include questions (e.g. what is a nested pageflow) that can be answered with Google.
Also see Weblogic Portal interview questions - I
  • What options do you have for Single Sign On for a Weblogic Portal application (and in general). Give the advantages and disadvantages of each approach
  • If you are using WSRP, and the user is logged in to the consumer , is he also logged into the producer? If so how? If not how do you do this?
  • If you have standard static HTML application, how would you optimise this for performance? For each of the technique's you mention , how would this be implemented in Weblogic Portal
  • How do you ensure that a Weblogic Portal application is easily Searchable by external search engines like Google
  • What are serious problems/ drawbacks of JSR 168/ JSR 286. Under what circumstances would you not use these for your portlet implementation? Under what circumstances would you use these for your portlet implementation?
  • Why is asynchronous desktop a bad idea? In what situations does it become a good idea?
  • What circumstances can cause issues with Portal Propagation. Would you use propagation in your actual production? If not , why not?
  • How would you integrate Flex / Any flash based widget into your portal application?

Monday, October 12, 2009

Detecting missing files with JMeter

I have lately found that JMeter is becoming my tool of choice for almost all the normal mundane programming tasks. Case in point.
On my current website's we have a bunch of PDF's (150K) which are accessible only via search and have an entry in some table for that purpose. Each PDF is linked to a language and multiple countries so the total number of rows in the database are much more than the number of files. Now years later, due to human error and other causes some of these records exist in the database but there aren't corresponding PDF files on the webserver , which allows the user to see a link when he searches for the data but a 404 error when the user actually clicks it. I had to generate a report listing all these files.

Constraints
a. Administrators wont let you run a program on the web server.
b. You could ask them to copy files to a separate directory but it takes about a week to get approval for anything related to production except for a Database copy (which is available immediately)

I initially thought of asking for a recursive file name print of all the files from the webserver to compare against the database (but writing this Java program would have taken half a day to iron out the bugs) . So I settled on JMeter

Run query to get a list of files and save it to a CSV (Squirrel SQL client)
Thread Group (10 in parallel)
CSV Data Set
Http Request (HEAD) , the web link to the PDF is read from the CSV

Run from command line with a sample_variables property set to fields from the CSV.
Time to run test = 1.5 hrs. JMeter's sample HTML was good enough to be shown to the users to fix the issues.
Whats great is that when the missing files are uploaded I can verify the data easily again.

Now I could have used JMeter's JDBC sampler to eliminate the Squirrel client.

Tuesday, October 06, 2009

Profiling BEA Weblogic Portal Apps

Profiling a portal application running on earlier versions of BEA Weblogic has always been somewhat painful(still is) if you aren't willing to pay for a commercial profiler(It still might be painful). With Weblogic 8.1 I had used Eclipse Colorer but that doesn't seem to work with the later versions of Eclipse and hasn't been developed in a while, it crashed on Weblogic 10 (JDK 1.5). I tried out a few from the Open Source Java Profilers page but some crashed the JVM and some didn't do what I want.
The basic things were
a. I needed to check execution times.
b. I didn't want to recompile my application or make changes to code.

I'd played around a bit with TPTP so I gave it a try and used it, it worked reasonably well, I eliminated some code that didn't cache data correctly , so all in all it was a success. I haven't had time to look through all the settings in detail , and I'm sure some of the settings are redundant , but they worked for me. I've created these steps using the latest available versions of TPTP/Eclipse.
I ran the test in Windows Vista. Folks using a different O.S. are probably smart enough to not need these steps.

Steps
a. Install Eclipse IDE for Java EE Developers

b. Install the TPTP(4.6.1) plugin. There are a set of screens on how to do this - http://wiki.eclipse.org/Install_TPTP_with_Update_Manager. You could also download the All in one which has Eclipse + TPTP. Also referred to a couple of links on TPTP. Profiling J2SE 5.0 based applications and TPTP installation guide

c. Download the agent controller for TPTP. Unzip it to a folder. Call this folder $AGENT_CONTROLLER_HOME

d. Set a new environment variable
JAVA_PROFILER_HOME=$AGENT_CONTROLLER_HOME\plugins\org.eclipse.tptp.javaprofiler



e. Set up the PATH ( I did this in Control Panel --> System --> Advanced --> Environment variables)
$AGENT_CONTROLLER_HOME\plugins\org.eclipse.tptp.javaprofiler;$AGENT_CONTROLLER_HOME\bin;
You should have Java in your path somewhere. I use the same JDK as that with BEA. (i.e. Java 1.5 . I did try Java 1.6, but it didn't work for me)



I run on Windows Vista so all command prompts are launched with Run As Administrator including the BEA server.

f. In a command window cd to $AGENT_CONTROLLER_HOME\bin and run setConfig. Specify the path to Java (1.5) and the other options, I chose the options in the screen below.



g. Start the agent controller (ensure no firewall blocking or unblock) by running acserver.exe.


h. In a new command line window run SampleClient. If all is well, you should see the response. Close SampleClient command window but keep acserver running



Setting up BEA
i. Goto the BEA portal domain and change the following settings in setDomainEnv.cmd (these already exist , just change the values)
set debugFlag=false
set testConsoleFlag=false
set iterativeDevFlag=false
...
set PRODUCTION_MODE=true

Towards the bottom of the file (4-5 lines from the bottom), add the command to enable the profiler

set JAVA_OPTIONS=%JAVA_OPTIONS%
set JAVA_OPTIONS=-agentlib:JPIBootLoader=JPIAgent:server=controlled,filters=$DOMAIN_HOME\filters.txt;CGProf:execdetails=true %JAVA_OPTIONS%

Here we specify that the process should wait (server=controlled) till we connect to it, specify some filters for packages that we have no interest in (and which would cause the system to be slower), specify that we want to capture executing details.

Create a file named filters.txt in the path you have specified
org.apache.* * EXCLUDE
com.bea.* * EXCLUDE
weblogic.* * EXCLUDE
netscape.* * EXCLUDE
antlr.* * EXCLUDE
com.octetstring.* * EXCLUDE
com.rsa.* * EXCLUDE
org.omg.* * EXCLUDE
javelin.* * EXCLUDE
kodo.* * EXCLUDE
org.opensaml.* * EXCLUDE
com.pointbase.* * EXCLUDE
serp.* * EXCLUDE
com.solarmetric.* * EXCLUDE
schemacom_bea_xml.* * EXCLUDE
com.asn1c.* * EXCLUDE
com.certicom.* * EXCLUDE

When I hadn't filtered out kodo packages , I did get a ClassFormatError so at a minimum these packages must be filtered

j. Now run startWeblogic. The process should wait(we specified server=controlled remember)



k. Now start eclipse. Click Run --> Profile Configurations. Click Attach to Agent and hit new icon. A new Configuration is created



l. Now click the agents tab, if all is well you should be able to see an entry



m. Double Click it and specify the filters (same as the ones specified in filters.txt)



n. Click Next, Uncheck the run automatically, Click finish.



o. Click Apply and Profile. Switch to the profile perspective.



We haven't started profiling yet, but the Weblogic server will now continue start up. You probably have to wait about 10 minutes.



p. Once Weblogic is in running mode , you can start the profiling by clicking the run icon in the left pane. You can also click the execution statistics (though this might be empty since we have filtered most of the default BEA code that runs.



q. Now exercise your application by accessing it in the browser or by running a test e.g. a JMeter test.
You should now be able to see execution details in Eclipse. For e.g.



which shows 100 calls being made to DBService. Double click it.



which shows the method calling it (TestService.getList() 1 call here makes a 100 calls to the DB , plus some BEA security checks). The TestService is called by the Portlet Controller as shown



And you can easily conclude that there is some sort of N+1 problem here. A single request leads to 100 db calls, after which inspect the code fix the problem, rerun the profile and verify that you only invoke the DB once.

However there is a caveat here, it is far far easier to profile your code out of container. If you can separate out your code so that most of it runs outside Weblogic , then it's easier to profile it. And as we all know, this isn't always possible.