Kit Fenderson-Peters [userpic]

[GEEKY] Slowness on beta.scribblit.com

December 18th, 2007 (08:35 pm)
tired

current mood: tired

I have noticed on my work laptop (when I have it at home) that certain things on Scribblit's beta machine, which I administer, are poky. Specifically, editing one's userinfo and editing journal entries seem to be poky, but there may be other things I haven't seen. I'm not the only one experiencing it - several SI users have reported it as well. But when I try it from [info]ruisseau's machine, it's nice and quick. Same if I have my work laptop at work. Which leads me to believe it's an ISP issue, not a site issue. But if it were an ISP issue, why would it work differently on two machines on the same network (i.e. [info]ruisseau's and mine)? It could be that [info]ruisseau is running Linux, while I am running Windows. But if that were the case, why would Beta run quickly at work but slowly at home?

I must confess that I'm stymied. Geeky types, I have posted the headers from the quick and the slow sessions on Beta in my staff journal on SI. If any of you can offer some insight as to why Beta is running slowly, while keeping in mind everything I've said above, I'd be very grateful.

Comments

Posted by: toast ([info]toast0)
Posted at: January 5th, 2008 06:33 am (UTC)

Just looking at the headers you provided, in your slow case, the return was not compressed. I don't know how big the page is if it's gzipped, but it was 62k coming back uncompressed.

I would investigate why you're not getting a gzipped result at work. I would also be interested in seeing if the request you send is the request that arrives at the beta machine (tcpdump is your friend!).

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 01:35 am (UTC)

Well, I've just gone and looked at the headers for SI main, and they are also uncompressed. The difference, however, between SI main and Beta is that SI main is not slow like Beta is.

Posted by: toast ([info]toast0)
Posted at: January 8th, 2008 02:03 am (UTC)

yes, but beta.scribblit.com is also hosted in a different location.

Based on traceroutes, I think www.scribblit.com is hosted in Dallas, and www.beta.scribblit.com is in Ontario. And based on your profile, you're in Kansas City. So beta.scribblit is about twice as far away as www.scribblit. A 65k response requires at least 44 packets (probably a few more), and TCP slow start means this will take several times the round trip time between your client and the server. I'm assuming non-persistent connections. I'm not going to do the math to figure it out, but let's say it takes 6 round trips to get 44 packets (I'm guessing based on slow start being exponential growth in the number of packets outstanding, and 2^6 > 44). If that's the case, beta.scribblit will appear to be 12 times slower than www.scribblit, simply because it's twice as far away. If compression limits this to about 5 packets (html compresses very well), it'll take about 3 round trips, and only seem to be about 6 times slower than the main site.

There may be other differences between the two hosts other than location.

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 02:18 am (UTC)

I'm sure there are other differences between the hosts. Regardless, a lack of compression should not cause the degree of slowness that I'm seeing - the pages never finish loading. Additionally, if the issue were lack of compression, or something else at the server level, it ought to affect both Linux and Windows clients equally. Instead, Windows users are experiencing the slowness, but Linux users (at least my wife) are not experiencing any appreciable slowness at all.

Posted by: toast ([info]toast0)
Posted at: January 8th, 2008 02:54 am (UTC)

Is it a fair comparisson (from empty cache/ shift reload)?

Looking into it a bit more, I can replicate a considerable slowness by viewing http://www.scribblit.com/users/staff_kit/profile vs http://www.beta.scribblit.com/users/staff_kit/profile

Especially if I do a shift reload. By watching firebug, it seems that static resources (javascript, css, images) are sometimes taking a rather long time to be sent. I'm honestly not sure if this is a client problem or a server problem.

(I'm currently on a windows + firefox machine, i can check again at home with Debian + firefox and see if that performs differently)

Also, I noticed beta.scribblit has it's date set wrong (by about 4 hours... I don't know why this would be a problem, but hey?).

Can you tell if beta.scribblit is swapping, and what it's apache MaxClients is set to?

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 04:03 am (UTC)

MaxClients of 20.
MaxRequestsPerChild of 500.

Is it normal for an httpd instance to be 121M resident?

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 04:07 am (UTC)

MaxClients of 20.
MaxRequestsPerChild of 500.

Is it normal for an httpd instance to be 121M resident?

Memory usage seems awful steep to me too - I've had monit kill apache b/c its total memory footprint got upwards of 1.5G.

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 04:08 am (UTC)

Lord, what if it's monit?

edit No, I don't think it's monit. Monit hasn't been killing httpd lately AFAICT.

Edited at 2008-01-08 04:10 am (UTC)

Posted by: toast ([info]toast0)
Posted at: January 8th, 2008 05:49 am (UTC)

I don't think 121M is too big, mod perl will tend to keep things resident between requests, and perl likes its memory. If you're concerned about memory usage, I would reduce MaxClients. Less concurrency tends to do better than more. When you hit MaxClients, connections will begin to queue, but as long as traffic is bursty, this should be fine.*

I'm also not keen on MaxRequestsPerChild being 500, but I'm not at all familiar with running LiveJournal. I'm usually suspicious of MaxRequestsPerChild under 10000, but I've seen worse (I know of groups which have used MaxRequestsPerChild under 10, yikes!)

Hmmm... I am seeing significantly better performance on Firefox on Debian than I was on Windows. While watching firebug, I don't see any requests taking significantly longer than others. (I do have a nicer computer at home though). I was intrigued enough to load up firebug in windows on vmware, and saw the same performance as in Debian, but not quite enough to connect to my computer at work.

Did you make some changes? Otherwise, it could be higher usage during the day causing problems... or I'll be connecting to my computer tomorrow night.

* If you are always against the MaxClients, you're probably out of cpu, memory or i/o somewhere :D, or your MaxClients is set to less than your number of cpus.





Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 06:48 am (UTC)

No, no changes. This is the situation as it has always been. The site is slow on Windows clients, but quick on Linux clients.

Posted by: toast ([info]toast0)
Posted at: January 10th, 2008 08:17 am (UTC)

Ok, I did some more poking around. I'm very certain that the immediate cause of the feeling of slowness is that the larger resources take a long time to download.

I ran
time curl http://www.beta.scribblit.com/js/dom.js?v=1196030087 > /dev/null

to get an approximate download time from various hosts around the us and got varying times, but all were over a second. (best times)

home: San Jose, CA, linux (direct to dsl): 1.26s
parents: Fountain Valley, CA, linux (direct to cable): 3.58s
work (at the office): Santa Clara, CA, freebsd (natted, but high bandwidth): 2.58s
work (in a remote facility): Reston, VA, freebsd (firewalled, but no nat, high bandwidth): 1.58s
school: Milwaukee, WI, linux (firewalled, partial t3, congested): 2.19s

Other than the school, all of these were able to consistently run
time curl http://www.scribblit.com/js/dom.js?v=1196030087 > /dev/null
at under 0.3s. (the school runs at 0.6 - 0.8s, but it's on a congested link... but I don't have anything else to run on in northern part of the country)

I'n not sure if I can figure out anything more about why linux at my house does so much better than freebsd at my office. I can tell you that tcpdump on both for the same curl produces more packets on freebsd than on linux. (looks like more acks, but it's hard to read)

The other weird thing is that there are some random seeming packets that beta.scribblit sends when i download something from it. I don't know what that's about but do tcpdump host www.beta.scribblit.com and port not 80 and then do curl http://www.beta.scribblit.com/js/dom.js?v=1196030087 on another terminal. That's certainly not normal.

Actually, running
tcpdump -nXs 0 host www.beta.scribblit.com | less
while fetching shows exactly what's happening.

Some of the packets are being malformed (who knows what happened there, likely some network hardware is FUBAR), and then a packet comes out OK, linux issues a sack, and eventually beta retransmits the first two packets that got corrupted.

My guess is that windows isn't doing a selective ack, so it has to wait for beta to retransmit the packet on its own.

Weird shit!

Posted by: toast ([info]toast0)
Posted at: January 11th, 2008 05:33 am (UTC)

Not sure if you saw this (cause the post was kind of long).

Short version:

Some network hardware beta is connected to is mangling some of the packets beta is sending (possibly some that it recieves as well). To the tcp stream, this is recieved as packet loss. Linux has selective ack on by default, and I don't think windows does, and I know my freebsd machines don't. With selective ack availble, the receiver can signal to the sender that packets are missing when it recieves packets after the missing/mangled packets; without selective ack, the reciever can only wait for the sender to resend the packets when it decides it hasn't recieved the ack. This difference is probably why things seem slower on windows.

I can provide a pcap dump if you don't see the mangled packets.

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 11th, 2008 01:13 pm (UTC)

I did see your comment, and did a bit of looking myself. I must admit that this is beyond my expertise at the moment. :) Looking at the results of a tcpdump with Wireshark, I saw a lot of retransmissions and continuations. Is that what you're talking about?

Posted by: toast ([info]toast0)
Posted at: January 11th, 2008 05:45 pm (UTC)

Yeah, the retransmissions are a direct consequence of the packet mangling.

If you look at http://ruka.org/~toast/scriblit.out.gz (after decompressing) with wireshark, you should also be able to see the packets that come from 206.53.55.231.17481 > 64.142.55.64.20039 and
206.53.55.231.18516 > 64.142.55.64.21584

which are the actual mangled packets (if you poke around enough, you can see the actual content of the mangled packets is suspiciously similar to the first two packets ( 1:1449 and 1449:2897) when they get retransmitted correctly.

If you have access to another machine in the same subnet, I would try to run the same experiement between the two of them, as well as from the outside world to the second box to narrow down the problem. It is likey to be either a network card on the box, or the local network switch or an upstream router. (the local switch is probably the default route, but it may actually be a problem with it's upstream network card). I guess there's also the possibility of a bad operating system interaction, but I would think that's pretty unlikely.

Posted by: Kit Fenderson-Peters ([info]popefelix)
Posted at: January 8th, 2008 02:27 am (UTC)

I've posted the headers from my wife's Ubuntu box in the aforementioned entry in my staff journal. Maybe the difference will shed some light.

15 Read Comments