LFS Forum - "cannot connect to database" problem

#26 - Victor

Sun 13 May 2007, 01:14

Quote from Grudd :I got my issue with a Dell PE2800... broadcom too

what mode is your nic in now?

#27 - Grudd

Sun 13 May 2007, 01:20

Forced in 100MB Full duplex mode

#28 - Kada_CZ

Sun 13 May 2007, 01:23

You probably found this already, maybe you could try the suggested "fix" before compiling kernel. Just to ensure, that the problem is in the bce driver.

#29 - Victor

Sun 13 May 2007, 21:50

well, while testing on my personal Dell server, the forcing to 100baseTX fullduplex trick actually worked.
So then I thought I'd do the same on the forum server -> resulted in massive packet loss.
I then thought I'd give the 10base option a try. It was then that the server became totally unresponsive. So I phoned support to have it rebooted. After they got rebooted, one never came back up. It appeared that the network socket had busted pins and therefore coincidentally broke the connection when i changed the NIC mode to 10base.
Then I had the support guy swap the cable to the secundary NIC port after i updated the rc.conf file so that port would be used rather than the busted one. But after the reboot it never came back online again.
So then you'd think "well, just have the support guy login via the console and edit the config file" - yeah that would be nice, if the Dell server had a PS2 port to hook up a keyboard to. IT DIDN'T .. oh my god. It only has usb ports and the support ppl didn't have a usb keyboard anywhere.

So after all this trouble, I decided to move the forum back to the old servers while the problems on the new ones can be solved.

But in the end, the forcing to 100base did not work on the bce drivers

So the problem still has not been solved.
I'm thinking we could plug in a pci nic of a brand that we know will work. That would solve the problem and also the busted socket problem ..

bah

#30 - the_angry_angel

Sun 13 May 2007, 22:09

Quote from Victor :But in the end, the forcing to 100base did not work on the bce drivers

So the problem still has not been solved.
I'm thinking we could plug in a pci nic of a brand that we know will work. That would solve the problem and also the busted socket problem ..

Warranty time I think

Honestly, the fact that they don't have a USB keyboard handy is a bit worrying to be honest - We're starting to see it being phased out on all rackmountable servers we're supplying, from IBM, Dell and others, and we don't supply rackmounted kit as much as we have in previous years, at the moment, so quite why a company who I'd expect specialises in rackmounted kit hasn't got at least 1 to hand very is scary.

I guess that connecting via serial isn't an option for these guys either (or your box isn't setup for serial connections)?

On the plus side, as least you know the probable cause of the problem for the moment

It's just a shame you found out now

#31 - Victor

Sun 13 May 2007, 23:01

Quote from the_angry_angel :Honestly, the fact that they don't have a USB keyboard handy is a bit worrying to be honest

hmm and i thought Redbus was the biggest datacentre company in the uk - slightly odd indeed, but well, can't do much about that now unfortunately. We'll see tomorrow when Supanet office folks are back at work again.

See, weekends are not a good time to do this kind of experimenting

#32 - Grudd

Mon 14 May 2007, 00:09

Quote : But in the end, the forcing to 100base did not work on the bce drivers

It worked for me, it worked on your personal Dell server... strange it didn't work for the forum server

I'm going to PM you Victor...

#33 - Victor

Mon 14 May 2007, 00:16

the drivers on my personal server are different than the ones on the LFS servers. bge vs bce respectively. Maybe that makes a difference, i dunno.

and i have pm's disabled - you can mail me though

#34 - Dygear

Mon 14 May 2007, 23:08

mysql_pconnect

#35 - Victor

Tue 15 May 2007, 00:04

that'd be more like a patch rather than a cure, so i won't use that.

#36 - Dygear

Tue 15 May 2007, 04:16

If you should come up with a solution some time soon, do let me know about it. As I might run in to something similar very soon.

#37 - Anarchi-H

Tue 15 May 2007, 11:15

We have a similar issue where I work.

Which MySQL client version are you using (i.e. the PHP side)? And which version of the Server? Have you checked for known bugs?
We are using 3.2x something. A recent visit from a MySQL consultant suggested this may be one cause as it is notoriously buggy.

As a temp fix we increased the resiliency of teh DB layer by attempting multiple connects with increasing timeouts (up to a limit). If it fails every time something is obviosuly broke. We also logged every failed attempt with a timestamp so that we can correlate that back to the DB logs. Conclusion hasn't been reached on that yet cause the DB admins are busy with a MySQL upgrade.

#38 - Victor

Tue 15 May 2007, 12:13

Quote from Anarchi-H :We have a similar issue where I work.

Which MySQL client version are you using (i.e. the PHP side)? And which version of the Server? Have you checked for known bugs?
We are using 3.2x something. A recent visit from a MySQL consultant suggested this may be one cause as it is notoriously buggy.

As a temp fix we increased the resiliency of teh DB layer by attempting multiple connects with increasing timeouts (up to a limit). If it fails every time something is obviosuly broke. We also logged every failed attempt with a timestamp so that we can correlate that back to the DB logs. Conclusion hasn't been reached on that yet cause the DB admins are busy with a MySQL upgrade.

The tests I did were unrelated to mysql. I just created connection tests to _a_ port and this was showing the problems. Oddly enough, it also showed the same connection failures on this freebsd box the forum has been running on all along.

Same thing when doing the test towards my personal server, but when I did the forcing of the nic to 100baseTX full-duplex on that box, the problem went away and I got a 100% connection rate.

On the note of multiple attempts, I thought vBB already does this. Or at least, when I started to look for that in their code, I read a comment saying the connection is attempted 5 times - the connect function is even inside a do-while loop. But from what we could see irl, I don't think it actually attempted 5 times.

But I don't want to have to do multple connection attempts

It needs to be perfect. We've spent quite a bit of cash on additional servers and having a 1% connection failure is 1% too much.

we run mysql 4.1.22 server. I can't check client as i don't have access to the webserver atm. I believe it was the same client version. PHP reported a 5.xx version for the mysqli interface though. I'm not sure why or how.

I'm currently trying to find someone who can quickly get some new nics to the pop and either install them, or just deliver them to the pop.
oh how i not like being so far away from them

#39 - Kada_CZ

Tue 15 May 2007, 13:57

Quote from Victor :I'm currently trying to find someone who can quickly get some new nics to the pop and either install them, or just deliver them to the pop.

Maybe also buy an usb keyboard and fasten it to the server.. I think, that new nics known-working-flawlessly under freebsd are the best solution.

#40 - Victor

Tue 15 May 2007, 14:22

actually i wonder now if the nics will make a difference. I'm doing some more tests and getting weird results. Will elaborate on this later on when i'm more sure of the results.

Anarchi-H - what OS are you running that db on?

#41 - Anarchi-H

Tue 15 May 2007, 16:48

Quote from Victor :Anarchi-H - what OS are you running that db on?

Debian etch
Machine is a Dell Poweredge 2950 IIRC

#42 - Victor

Wed 16 May 2007, 01:28

ok my last tests were not good and incomplete.

I've got some new results that all point to one thing : freebsd has a problem. Or at least, the 4 installs of mine have. But due to the variety of h/w, I find it hard to believe this is actually a hardware issue, hence I dare to say it's freebsd.

Have a look :

A = redhat in burnley		Intel nic
B = freebsd 6.2 in london	Broadcom (bce)
C = freebsd 6.2 in amsterdam	Broadcom (bge)
D = ubuntu in rotterdam		Intel
E = freebsd 5.4 in rotterdam	Realtek
F = freebsd 6.0 in burnley	Intel


SOCKET CONNECTIONS TEST ERRORS :

E -> B	7
E -> A	30

C -> B  27
C -> A	4
C -> F	9

B -> A	2
B -> C	4
B -> F	5

A -> B	0
A -> C	0
A -> F	0

D -> B	0
D -> C	0
D -> F	0

Imo this is super weird. I tried googling, but there's no mention what so ever about this issue. Am I really the only one with connection problems on all my freebsd boxes? I find that hard to believe!
But the good news is, there's nothing wrong with our new servers' hardware.

Another thing to note is that when there is a connection error, the packet was never actually sent out of the interface.
The socket was created fine, but the creation of the connection fails - a syn packet was never sent. At least, I do not see anything in pflog. There is a 'gap' in pflog when the error happens.
I also tried without pf, but this made no difference. Or at least, instead of an immediate error, there was now a big timeout before the connect() function returned false.

I am hoping this could be fixed in a way by tweaking some sysctl variables, but i haven't found anything useful yet.

What I'd really be interested in, is if anyone reading this with a freebsd install could run a connection test themselves. I've attached a simple c program that you can compile and run to see if the 1000 connection attempts it will create towards this forum are all OK or if some fail.
If anyone runs the test, please do note your exact os and version and I guess it'd be handy to know the nic and driver type / version too.

(ps, the program is really simple, I know

)

Attached files

#43 - Kada_CZ

Wed 16 May 2007, 02:43

I did a test from a freebsd box to the ubuntu box, the machines are on different subnets (2 hops). The result was 0 (zero) errors.....

The freebsd box:
$uname -a
FreeBSD <snip> 6.1-PRERELEASE FreeBSD 6.1-PRERELEASE
root@<snip>:/usr/obj/usr/src/sys/<snip> i386

$pciconf -lv
...
vr0@pci0:18:0: class=0x020000 card=0x80a11043 chip=0x30651106 rev=0x74 hdr=0x00
vendor = 'VIA Technologies Inc'
device = 'VT6102 Rhine II PCI Fast Ethernet Controller'
...

I don't know how to discover the driver type, I have no experience with freebsd. The kldstat command shows:

1    8 0xc0400000 3ce45c   kernel
2    1 0xc07cf000 628f4    acpi.ko
3    1 0xc329b000 1d000    radeon.ko
4    1 0xc32b8000 f000     drm.ko

I don't think, that the specs of the Ubuntu box are relevant here, but if you want them also, let me known.

Maybe, could you run the rest of the tests (i.e. on each machine test connection to all 5 remaining boxes)?

EDIT:
/etc/sysctl.conf in the freebsd box:

kern.ipc.somaxconn=1024
kern.ipc.maxsockbuf=1048576
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=65536

vfs.usermount=1

Isn't the problem some freebsd anti-dos attack feature? I just read this article.

EDIT2: I attached the result of the command: sysctl -a|grep ^net

Attached files

#44 - Victor

Wed 16 May 2007, 13:14

thanks for your test and the sysctl values. I've been comparing them last night, but they're pretty much the same. And the ones that were different didn't make any impact at all when I tried changing them.

Also the dos idea led me nowhere.
The thing is, the failing outgoing connections appear to never even reach the network card or the firewall. There is no mention of them anywhere.

I'm starting to run out of options - even did tests, removing ipv6 from kernel, since you appear not to have that, but that didn't change anything.

#45 - Kada_CZ

Wed 16 May 2007, 14:05

Could you attach your "sysctl -a"? (I'm not sure, if there aren't some sensitive informations, there are part of the logs, at least). I could ask some freebsd guru to look at it.

#46 - Victor

Wed 16 May 2007, 14:41

which of the 4 sysctl's you want?

I'll attach the one from my test box here at home.

I have a feeling the problem is deeper into the os though and not solvable by a sysctl value. I have been up all night until 10 in the morning trying all kinds of different sysctl values, pf values, kernel recompiles with different settings, googled all night long to see if there were others with the same problem - nothing. But of course if someone can help, PLEASE YES

And it just boggles my mind why _all_ my 4 different freebsd boxes have the problem and yours doesn't. I hope there will be some others who can do the test still.

Attached files

#47 - Kada_CZ

Wed 16 May 2007, 15:27

Quote from Victor :And it just boggles my mind why _all_ my 4 different freebsd boxes have the problem and yours doesn't. I hope there will be some others who can do the test still.

I tested it on 3 other freebsd boxes, no errors, but they are the same hardware and configuration.... I'll do the test on some non-pc Unixes tonight, but i'm almost sure, that there will be no errors.

#48 - Kada_CZ

Wed 16 May 2007, 16:29

Could you try to set (on your home box "E"):
net.inet.tcp.rexmit_min: 3
if you didn't try it already... And rerun test from E to A.

#49 - Victor

Wed 16 May 2007, 16:42

Quote from Kada_CZ :Could you try to set (on your home box "E"):
net.inet.tcp.rexmit_min: 3
if you didn't try it already... And rerun test from E to A.

that value only takes 10-folds, so 10, 20, 30, etc.

I tried with 10 - no difference from E to A. It got 13 errors this time (the number of errors always fluctuates a bit - sometimes it's more, sometimes it's less, but they're always there)

#50 - Victor

Wed 16 May 2007, 17:52

HM I may have found something that points to the cause.

i did a test on 127.0.0.1 - that all worked lovely - no errors.

I then had a look at pfctl -s all to see the pf stats build up with these connections. Because with the other tests I could see the 'current entries' number rise steadily during the connection test. (current entries for the states buffer)

But when doing the localhost test, this number did not rise.

So, i opened pf.conf and looked for my localhost in line. That said :

pass in inet proto { tcp, udp } from $localhost to any

I noticed it has no 'keep state' - that would explain why the 'current entries' number of pf did not rise.
So I tried adding 'keep state' to this localhost rule :

pass in inet proto { tcp, udp } from $localhost to any keep state

I ran the test again, while looking at pfctl -s all
As expected the current entries number now rised steadily.
AND GUESS WHAT - there were now errors on the localhost connection test.

SO, is this safe to conclude that in fact there is a problem with the conenction states buffer / mechanism / whatever? hmmm!

I should add though that the state buffer is far from full.
But I'll double / triple check anyway to be sure (i did a lot of testing last night regarding this though)