The online racing simulator
"cannot connect to database" problem
Hi,

I'm at a total loss here at the moment - I hope someone can give some advice :

As you probably know we're moving servers. Last night I've moved the forum.

This is now making use of a dedicated database server and a separate webserver. It works fine in general, but let's say out of every 100 connections to the database, 1 doesn't seem to connect.

This is the problem. I don't have any more debug information. I just get the error message mailed to me saying
Quote :mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Can't connect to MySQL server on '213.xx.xx.xx' (1)

and that's it. I have no idea why. Does anyone happen to have experience with this? Or ways to debug why it cannot connect?
The two servers are connected via 100mbit switch. I think i might want to use the second gigabit port of both servers in the future - would that help? The problem is that there will be two more webservers connecting to the database in some time from now; those cannot be connected directly, as the database server doesn't have more ports available.

post the line that has mysql_connect() function on it. Also include any variables that are being set that are inlcluded with the function
Sorry, but I do not see how that is relevant. The connect function takes a hostname/IP, username, password and database name. Those are always the same, so they either work or not.
The problem has to be outside of php, either in the network or in mysql. But since mysql has been setup to accept loads of simultaneous connections (and atm the maximum simultaneous connection count is just 7) I don't think it's mysql's problem either.

So, can the network, with this low usage, be overloaded already and reject new connections? I highly doubt it, but that is what it _appears_ to be. I'm just guessing here though. Maybe there's some setting i'm overlooking (i hope, but dobt it)

Or we may have faulty hardware, but i haven't been able to detect any hardware problems as such yet.
oooh, even though we don't use windows, this might be interesting. Thanks, will check this out!
meh, misread it.

I'm guessing the SQL is on a different server to the webserver?
Try manually adding a DNS statement for it? rather than using a hostname? Or try a direct IP address?

It could be a failing DNS server. just a thought
Pff, should've guessed that

While googling I also find some posts like
Quote :I've got a library of PHP code whose first line is a mysql_connect
statement, like this:

$dbh=mysql_connect() or die("mysql connect failed: $php_errmsg");

Approximately 1% of the time it just fails, for no stated reason:

Warning: mysql_connect() [http://www.php.net/function.mysql-connect]:
in /var/httpd/htdocs/pi/pi.php on line 3
mysql connect failed:

Any ideas why this would be happening? PHP is version 4.3.1 (same
results with the latest 4.3.3 release candidate), Mysql is 4.0.12

thanks
-jsd-

With the best guess being running out of connections as the source, but nothing more concrete yet.
Quote from Krammeh :meh, misread it.

I'm guessing the SQL is on a different server to the webserver?
Try manually adding a DNS statement for it? rather than using a hostname? Or try a direct IP address?

It could be a failing DNS server. just a thought

again, if that were the case it would either never or always work. Andriod's last post is right on the target and is exactly the problem I'm having.

and i'm already connecting by IP - connecting via hostname would use an additional step, which would be a waste.
Quote from AndroidXP :Okay, I'm pretty out of the loop regarding MySQL and network connectivity (not that I ever was much in the loop to begin with), but maybe it's this?

http://dev.mysql.com/doc/refma ... nect-to-server-on-windows


ok after some checking, this can't be it. The load is too low.
There are just too few connections and/or TIME_WAITs on the box. Maximum I've seen is 20 or so.
Compare that to our other servers that easily have 500+ TIME_WAITs and work without any problem. It's gotta be something else
I had similar random issue on a MS Sharepoint 2003 server.

This was a network issue, hardware settings related -> speed settings on the swith port and / or the NIC.

Try to force or automate the speed both on the switch port and NIC.

Hope it'll help
Quote from Grudd :Try to force or automate the speed both on the switch port and NIC.

could you elaborate on that a bit?
Do you mean the things like xxxbaseTX and duplex modes?

Like atm the NIC of the db server is in mode :

media: Ethernet autoselect (100baseTX <full-duplex>)

Yes i mean the speed and deplux mode

Actually you're in "autoselect" mode for the NIC, detecting a 100mb full duplex.
Do you know if it's the same on the switch port ?

My advice is that you can try to force those settings instead of "autoselect"
I guess nothing is showing up in the logs? It might be worth running mysql under --verbose to see what's going on. It might also be worth checking the max_connections to see if it's accidentally been mis-configured. I believe by default it's 100 connections in any state.
Quote from Grudd :Yes i mean the speed and deplux mode

Actually you're in "autoselect" mode for the NIC, detecting a 100mb full duplex.
Do you know if it's the same on the switch port ?

My advice is that you can try to force those settings instead of "autoselect"

i could try, but not during the weekend - if something goes wrong we'll have no forum at all )
I can enquire on monday. The servers are in london and I'm in holland
Quote from the_angry_angel :I guess nothing is showing up in the logs? It might be worth running mysql under --verbose to see what's going on. It might also be worth checking the max_connections to see if it's accidentally been mis-configured. I believe by default it's 100 connections in any state.

i'mma taking a break and then I'll check that out.
max connections is set to 200 atm though.

my.cnf :

[mysqld]
innodb_data_home_dir =
innodb_data_file_path = ibdata2:10M:autoextend

innodb_additional_mem_pool_size = 16M
innodb_buffer_pool_size = 200M

max_allowed_packet=8M
max_connections = 200
key_buffer = 128M
myisam_sort_buffer_size = 64M
join_buffer_size = 4M
read_buffer_size = 4M
sort_buffer_size = 32M
table_cache = 3600
thread_cache_size = 512
wait_timeout = 900
connect_timeout = 10
tmp_table_size = 32M
max_connect_errors = 10
query_cache_limit = 2M
query_cache_size = 64M
query_cache_type = 1
query_prealloc_size = 131072
query_alloc_block_size = 32768
read_rnd_buffer_size = 2M

wait_timeout might be a bit high - that's about 15 minutes? AFAIK vBB doesn't use persistant connections?

There's also a max_user_connections directive - it isn't limited by default - or shouldn't be. It might be worth checking SHOW VARIABLES against the my.cnf to see if they match up.
Quote from the_angry_angel :wait_timeout might be a bit high - that's about 15 minutes? AFAIK vBB doesn't use persistant connections?

There's also a max_user_connections directive - it isn't limited by default - or shouldn't be. It might be worth checking SHOW VARIABLES against the my.cnf to see if they match up.

some 'max' variables

max connect errors 10
max connections 200
max delayed threads 20
max error count 64
max user connections 0

VB can work with persistent db connections, but it shouldn't have to - not with the very low load and connection count atm. Note that max. concurrent connections is just _7_. This should be a piece of cake for our server (8 cores, raid 10 15k rpm sas drives etc)
Hmm.. I'd attack it with beer and mysql running under verbose mode

Incidentally, aren't SAS drives awesome We've been putting them in for customers - absolutely fantastic bits of machinery
Quote from the_angry_angel :Hmm.. I'd attack it with beer and mysql running under verbose mode

Incidentally, aren't SAS drives awesome We've been putting them in for customers - absolutely fantastic bits of machinery

i agree - already started with the beer

and yep, can't complain about the drives - i already had a single one in a personal server i got lately and that already was pretty damn fast. So imagine two, raided heh.
Anyway the idea was that this dedicated database server should run all db stuff for us. I'm having doubts now though. The amount of data that this machinery can process greatly surpasses the interface it has to transfer it through, so hmmm unless we swap the switch for a gigabit one, i'm not sure if i'm gonna keep it as a dedicated db server...
Without knowing your load, I possibly couldn't comment, but I'd say that with a box like that, you could run lfsw, lfsforums and lfs.(net|com) from that without too much loss in performance. Stuck for bandwidth out then though, I'd imagine

Definately would've gone for a gigabit connectivity between the servers though. We're running a 100-300 io/sec SQL instance on a customer site and it struggled without it. Grantd it is MSSQL, but that's not the point
hmm logs show nothing.

I did notice one thing though.

When I restarted the server, there was a brief period in time where the server was shut down. I then received this error :

mysqli_real_connect() [<a href='function.mysqli-real-connect'>function.mysqli-real-connect</a>]: (HY000/2003): Can't connect to MySQL server on '213.40.20.1' (61)

This is slightly different from the other 'regular' error :

mysqli_real_connect() [<a href='function.mysqli-real-connect'>function.mysqli-real-connect</a>]: (HY000/2003): Can't connect to MySQL server on '213.40.20.1' (1)

See the difference?

First has an error code of 61 and the second 1.

I tried to lookup the difference, but I think i'm gone blind and cannot find a simple list of error codes and their explanations. But I think it's safe to assume that 1 means 'socket error' and the second means 'there was nothing listening on the other side'. So for now I think Grudd was right with his assumption that there's something wrong with either a nic or the switch.
I'm gonna see if I can do some connection-stress tests and see if i get similar results.
HY000 -- (general error) is used for unmapped errors, server error messages,
2003 -- Can't connect to MySQL server on ..., client error messages,
$ perror 1 61
OS error code 1: Operation not permitted
OS error code 61: No data available

perror is a program, that prints error messages, it's part of mysql-server-5.0 package in my Ubuntu. The error codes are also in /usr/include/mysql/mysqld_ername.h and mysqld_error.h (libmysqlclient15-dev package).

What about to write your own simple program/php script, only for testing, to reproduce the error by your simple program? It would allow you to play with the code...

EDIT: Notice, that the error messages for 1 and 61 could be different on your server, it's OS dependent, run the perror on the server to get the right ones.
interesting! Very nice.
Will check it out after i had my dinner

Thanks!
ugh, so i did an evening of testing and the results aren't promising.

A bit cryptic :

SERVER MAKE NIC BRAND OS
SERVER A = brand unknown + some Intel PRO 1000 + freebsd 6.0
SERVER B = brand unknown + some Intel Pro 100 + redhat old
SERVER C = Dell + BCM5708 + freebsd 6.2-STABLE
SERVER D = Dell + BCM5708 + freebsd 6.2-STABLE
SERVER E = Dell + BCM5750 + freebsd 6.2-RELEASE (NON LFS)

The connection tests were made up of 1000 regular socket-connections
@ 10 connections per second (ie. nothing stressy) :

D -> B : 0 Errors
D -> C : 1 Error
A -> C : 32 Errors
A -> D : 31 Errors
A -> E : 35 errors (!!!! alert, as server E is on a totally different network)

UPDATE :
A -> B : 6 Errors (now I'm baffled)

Servers A and B are our two old LFS servers, located in Burnley.
Servers C and D are our two new LFS servers, located in London.
Server E is my personal server, located in Amsterdam.

As you can see the Dell servers are the ones with the problems. This was confirmed when I did the connection test towards my personal Dell server that is on a totally different network.
After some googling, I conclude now that the combination of Broadcom NICs + freebsd isn't 100% working yet

I think in the end I'll have to find new and better drivers for them on freebsd and recompile the kernels. I'm pretty sure now it's nothing to do with the switch or network or user-land software.

This stinks pretty badly and I'm gonna have to sleep on this a bit. For production servers this is not really acceptable

UPDATE
I figured for completeness sake I'd do a test between the two old servers as well. To my big surprise it turns out i got 6 errors there! So now I don't know anymore about my conclusion. I really need some sleep first I think...
I got my issue with a Dell PE2800... broadcom too

FGED GREDG RDFGDR GSFDG