MySQL / PostgreSQL limitation

Tue Aug 23 04:35:13 IDT 2011

2011/8/23 Hetz Ben Hamo <hetzbh at gmail.com>

> Hi,
>
> I'm thinking about some project, and I was wondering about something: lets
> say I'm setting up a really-big-web site. Something in the scale of Walla or
> Ynet.
> My question is about the DB: Where are exactly my limitations that I need
> to go master/slave, replications etc.
> Is it related to memory? processor? I know that many uses replication etc
> due to shortage or RAM and/or CPU power.
>
> Lets say I have a machine with 512GB RAM, 20 cores, 20TB disks. Such a
> machine should handle a huge load without any issue. With such a machine,
> theoretically, can a single MySQL / PostgreSQL handle such a load? (few
> hundred thousands connected users at once).
>

I can't answer these specific questions directly. We have a chubby (2TB
right now, and growing) PostgresQL database on Hitachi FC SAN 15k RPM RAID
1+0 SAS disks, 84Gb RAM, 8 CPU cores, controlled to fail-over to an
identical stand-by secondary using RedHat Cluster Suite (running on top of
CentOS and Xen) and it can hardly handle the load of some queries (it's not
a web site, it's mostly data warehouse loading thousands of new items per
minute and allowing customers to query and configure stuff through a web
"portal"). Our developers are looking at solutions to cluster Postgres, and
using SSD for disks. I'm not sure how much a larger single PostgresQL
instance would help. There are quite a few anecdotes and howto's about large
PostgresQL databases on the web (blogs, wiki.postgresql.org, etc).

I'm also helping a friend with an idea for a web site which might also grow
a lot and am fiddling with the idea of using a NoSQL database for it. Right
now I'm looking at Cassandra (e.g. see
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/).
It's a completely new way of thinking about a database development but the
benefits from the Operations and scalability perspective could be huge - the
high availability (replication means that if a node fails than another node
with identical data will serve queries), cost of hardware (start with small
cheap boxes and add as you go) and ease of management (which I expect to
have once I get the hang of it) look very attractive, and that's even before
I mentioned the options for Map-Reduce and distributed "data mining". See
also the SNA Project: http://sna-projects.com/ (actually a cluster of
related projects from LinkedIn).

You might also want to look at combining technologies, e.g. stick to a large
PostgresQL database to manipulate data but export the information into
smaller data stores which the web front ends will access through a simpler
interface. e.g. I heard that Google's famous Map-Reduce search intelligence
algorithms run in batch jobs on the main data gathered by the crawlers and
produces simpler key-value stores which are actually what your web search
queries from.

This is just the tip of the iceberg from someone who's still just watching
and reading, I didn't even get around to dip my legs in the water (except
the HBase quick start tutorial: http://hbase.apache.org/book/quickstart.html,
which succeeded on one desktop and failed on another). There is plenty of
material on the web about "large web sites".

Cheers,

--Amos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20110823/e2a27cbc/attachment-0001.html>