Hi there

We have been having a rather strange problem on one of our classic ASP
websites for a couple of weeks now. During office hours when under
load the site has 100% uptime. The problem is, that after hours
(usually around 2 AM-6AM) and over weekends, when there are hardly any
visitors, the site just becomes non responsive at times. It basically
just times out.

It runs on two Dell PowerEdge servers with Windows Server 2003
Enterprise Edition, a web server and seperate SQL server. The site is
not that busy, only about 16 000 unique visitors per day, so it is not
a server performance issue as we run decent hardware. The technologies
used are IIS 6.0 and SQL Server 2000 with load balanced Cisco switches
and firewalls between the DB and Web servers.

There are no errors in the System or Application logs, the http
performance log is also fine with nothing out of the ordinary.

The symptoms are pretty strange, the site just starts timing out at
night. There are no services scheduled at the time that the site goes
down, it also does not go down at the same time every night or over
weekends.

When I restart IIS, the site comes back online for a while then just
goes down again. However when I do not restart IIS, the site keeps on
going online and offline at random until permanently recovering by
itself around 6 AM before business hours and then staying online
throughout the day for the cycle to repeat itself the following
night.

Because it only goes down at night, it basically rules out the
majority off causes that I have looked at;
- Performance issues on the servers (would have failed under load)
- Anti-Virus/Firewall etc issues interfering with the server (would
have caused problems during the day)
- Website application issues (would have caused problems during the
day and we use the same CMS for a few sites and we don't have issues
on any other site)
- Backups on the server (Backup schedule does not correlate with
downtimes)
- No errors reported in eventlogs or http performance logs
- Windows updates (no updates were installed at the time that the
problems started)
- Server changes (there were no known changes to the servers at the
time the problems started)

What I am still looking at;
- DoS attacks at night, however there is no evidence of this and we do
not maintain the firewalls so log access is almost impossible and the
ISP does not wan't to play along, which makes me a bit suspicious.
- General network issues at night in the datacentre (this would
however not explain why an IIS reset temporarily fixes the problem)
- The site was working perfectly until a certain date, then suddenly
it started going down. Trying to find out what changed.
- What I find funny is that certain connections to the server seems to
go through. For instance on one internet connection the site is
available but on others it times out. It also times out on one
connection only to work 10 seconds later after a refresh on the same.
So it feels like something is interfering with sessions.

My main focus is however still on IIS 6.0. I feel that is where the
problem is but I am pretty much stumped currently. Anyone experieced
anything similar? Know of what could be causing this?

Sorry for the long and rambling explanation, if you need any more
details please let me know.

Thanks!

Re: Website failing after hours by Andrew

Andrew
Wed Jul 16 05:11:27 CDT 2008

Could it be that the disk drives are spinning down when it's not very busy,
and taking a long time to spin up again? Maybe with a cascading time-delay
effect such that the web server wakes up (a few seconds) then tries to wake
up the SQL Server (a few more seconds) and hey presto! the web browser
thinks the server has taken too long to respond?

Andrew



Re: Website failing after hours by prieurdp

prieurdp
Wed Jul 16 06:28:03 CDT 2008

Hi Andrew

The servers are Dell PowerEdge boxes (2950 and R900) with RAID5 arrays
so I don't think the hard drives spin down when the server is not very
busy. I am sure this is more an IIS related issue.

Regards

Prieur


Re: Website failing after hours by daKernel

daKernel
Wed Jul 16 21:58:11 CDT 2008

On Jul 16, 6:28=A0am, prieurdp <pri...@gmail.com> wrote:
> Hi Andrew
>
> The servers are Dell PowerEdge boxes (2950 and R900) with RAID5 arrays
> so I don't think the hard drives spin down when the server is not very
> busy. I am sure this is more an IIS related issue.
>
> Regards
>
> Prieur

How many other sites are the IIS box? If you have multiple sites are
each IIS Site in their own Application Pool? Are there any errors in
the event viewers regarding Application Pools?

Larry
www.windowsadminscripts.com

Re: Website failing after hours by David

David
Thu Jul 17 00:02:38 CDT 2008

On Jul 16, 4:28=A0am, prieurdp <pri...@gmail.com> wrote:
> Hi Andrew
>
> The servers are Dell PowerEdge boxes (2950 and R900) with RAID5 arrays
> so I don't think the hard drives spin down when the server is not very
> busy. I am sure this is more an IIS related issue.
>
> Regards
>
> Prieur



If you see timeout and it originated from IIS, then you will see
evidence of the timeout in HTTPERR Log file or long time-taken values
in the IIS Log files. Please report what those log files look like
during your outages. If there is nothing that takes a long time or
error in HTTPERR Log file, then the problem is originating within the
user application running on IIS and not at IIS itself.


//David
http://w3-4u.blogspot.com
http://blogs.msdn.com/David.Wang
//

Re: Website failing after hours by prieurdp

prieurdp
Thu Jul 17 04:25:14 CDT 2008

Hi Larry and David

There are 5 websites on the server and they all run in their own
application pools. There are no errors or anything of note in Event
Viewer or the httperr logs.

This is an excerpt from the httperr log during a time when the site
was "down";

2008-07-17 02:10:40 66.249.67.37 47909 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:11:40 196.41.30.38 36783 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:11:55 209.62.82.6 56257 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:12:05 196.36.164.97 29299 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:12:45 86.54.168.116 4515 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:12:50 66.249.67.37 34855 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:13:15 61.135.168.39 48215 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:13:50 196.41.30.38 37280 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:14:55 86.54.168.116 1732 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:16:00 72.51.41.47 43926 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:17:06 86.54.168.116 2926 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:17:11 66.81.86.251 35909 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:17:16 196.41.30.38 38163 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:18:21 210.21.120.22 58803 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -
2008-07-17 02:19:16 86.54.168.117 4177 196.38.192.2 80 - - - - -
Timer_ConnectionIdle -

During that time there are corresponding entries in the IIS logs to
match the above. At other times when the site is "down" however, there
are large time gaps in the log files indicating that nothing reaches
the site.

What bugs me most is that the site just times out. There are no IIS
error messages or anything of the sort. If the sessions reached the
web application, then surely IIS would still have logged something,
albeit an error in httperr log.

On the server we have had a lot of trouble with the Broadcom NIC's.
First it was the teaming software that was causing trouble, then later
driver issues. That seems to have been sorted but I will have a look
at that again as well as per http://forums.iis.net/t/1135205.aspx. We
don't maintain the server so troubleshooting hardware issues is
difficult and the ISP is not really co-operating.

I am going to recreate the website again with a new app pool tonight.
Maybe that helps. Last night I tested with removing the "Shutdown
worker processes after being idle...." but that did not work :)

Thanks

Prieur

Re: Website failing after hours by David

David
Thu Jul 17 13:57:55 CDT 2008

On Jul 17, 2:25=A0am, prieurdp <pri...@gmail.com> wrote:
> Hi Larry and David
>
> There are 5 websites on the server and they all run in their own
> application pools. There are no errors or anything of note in Event
> Viewer or the httperr logs.
>
> This is an excerpt from the httperr log during a time when the site
> was "down";
>
> 2008-07-17 02:10:40 66.249.67.37 47909 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:11:40 196.41.30.38 36783 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:11:55 209.62.82.6 56257 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:12:05 196.36.164.97 29299 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:12:45 86.54.168.116 4515 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:12:50 66.249.67.37 34855 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:13:15 61.135.168.39 48215 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:13:50 196.41.30.38 37280 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:14:55 86.54.168.116 1732 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:16:00 72.51.41.47 43926 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:17:06 86.54.168.116 2926 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:17:11 66.81.86.251 35909 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:17:16 196.41.30.38 38163 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:18:21 210.21.120.22 58803 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
> 2008-07-17 02:19:16 86.54.168.117 4177 196.38.192.2 80 - - - - -
> Timer_ConnectionIdle -
>
> During that time there are corresponding entries in the IIS logs to
> match the above. At other times when the site is "down" however, there
> are large time gaps in the log files indicating that nothing reaches
> the site.
>
> What bugs me most is that the site just times out. There are no IIS
> error messages or anything of the sort. If the sessions reached the
> web application, then surely IIS would still have logged something,
> albeit an error in httperr log.
>
> On the server we have had a lot of trouble with the Broadcom NIC's.
> First it was the teaming software that was causing trouble, then later
> driver issues. That =A0seems to have been sorted but I will have a look
> at that again as well as perhttp://forums.iis.net/t/1135205.aspx. We
> don't maintain the server so troubleshooting hardware issues is
> difficult and the ISP is not really co-operating.
>
> I am going to recreate the website again with a new app pool tonight.
> Maybe that helps. Last night I tested with removing the "Shutdown
> worker processes after being idle...." but that did not work :)
>
> Thanks
>
> Prieur


I am really not convinced there is an issue with IIS in your
situation. IIS just happens to be where your website is running and
hence you can observe a "problem" happening, but that hardly means the
problem has to do with IIS.

The culprits I am thinking of are:
- user configuration of the web server which conflicts with the
application's requirements
- host environment "issues"

Can you describe what you are doing to detect such "timeout" of this
website during off-hours?

If you are making a web request "ping" to the server , can you show
evidence in the IIS log file for every single such request during the
off-hours when timeout happens. I need proof that the request was even
handled by IIS and didn't get sent elsewhere by a networking router/
load-balancer, etc.

If all the requests are reaching IIS but are timing out, is it showing
up as large value in time-taken in the IIS log, or timeout in httperr
log. For both cases, please provide the log entry and Win32 error
codes.


//David
http://w3-4u.blogspot.com
http://blogs.msdn.com/David.Wang
//