Building Scalable Web Sites – Scalability

A scalable system has got 3 simple characteristics:

  • The system can accommodate increased usage
  • The system can accomodate an increased dataset
  • The system is maintainable

It’s not about speed or complexity, those are other topics.
For exampel, this is a very scalable program:
sleep(1);
echo “CIAO”;

Scaling the Hardware Platform

  • Vertical Scaling
    • You have just one machine and you just upgrade it o change it to one more powerful (more RAM and CPU)
    • Problem: the cost per processor doesn’t grow linearly
    • Problem: at a certain point, the best machine in the world could not be enough for our application
  • Horizontal Scaling
    • You add some machines if the some type, quite cheap
    • the cost per processor is linear
    • Problem: more maintenance required
    • Problem: underutilization / i.e.: if we scale for more HD space, we will not use all the CPU power available.

Redundancy

  • Cold spare – to start working needs setup and configuration (software and/or physical)
  • Warm spare – to start working needs to be flipped on bacause all the configuration is done. A good example is a MySQL slave server that will be used as master when the master dies.
  • Hot spare – it starts working automatically because the system can detect a failure if a component fails. Problem: flapping

Scaling PHP

PHP is stateless then is scalable: every request is served by just one process that doesn’t need to talk to other processes. So the requests can be server randomly on many servers. The requests from a same user can be spread on many server.
That is true as long as you don’t use session that write data on a specific server, then the following request needs to be served by the same server. You can work around by:

  • storing session data on a centralized database
  • using cookie (encrypted if necessary)
  • using a msession on a centralizaed server

Otherwise, as we said, you can make sure the requests by the same user are served by the same server (sticky session…see below).

Load balancing

  • The cheap method for load balancing is creating more than one “A” record un the DNS zone but

    • it’s not very quick to add/remove a server because of the DNS cache. Somebody could still hit the faulty server
    • we can’t balance according to the load
  • Then we need an appliance to load balance: it could be either software or hardware.
    If our site supports sessions, our load balancer should support sticky sessions (or we can use any of the alternative methods mentioned above)
  • LB with HW
    • expensive
    • could be hard to set up (because of the old-fashioned interfaces)
    • good if you plan to leave it there for ages
  • LB with SW
    • cheaper
    • some solutions: Pearbal, Pound, LVS
  • LB Layer 4
    • very commonly used
    • at TCP layer: it can support sticky sessions
    • round robin algorythm
  • LB Layer 7
    • I can use HTTP headers to load balance, in particulat the request URI
    • Then I could load balance using an hash of the URL so the same files will be served by the same server (or poll of servers) so I can optimize the use of cache
  • Load Balancing for MySQL
    It would be difficult to use a regular load balancer server because that uses a HTTP-based protocol.
    What I can do is use a random shuffle in the method that connects to the databases. If a server is more powerful than the others, I can add it multiple times to my list of available server to shuffle.

Scaling MySQL

There are mainly two storage engines:

  • MyISAM
  • Full-text index type
  • table-level locking
  • then, not the best for concurrency
  • InnoDB
    • supports transaction & triggers
    • takes as much as three times the HD space compared to MyISAM
    • row-level locking
    • then better for concurrency

    MySQL Replication

    Good when there are much more reads than writes (the typical scenario in WebApps).
    The data is replicated between multiple machines
    See notes

    Database Partitioning

    See notes

    Scaling Storage

    • As your database grows we need more space.
    • It’s easy to get big hard disks; what we really need is some sort of failover at storage level to avoid needing expensive clones. Then we need to implement RAID.
    • For scaling, we can use federation
    • For sharing storage among different machines, the easiest way on Linux is NFS. NFS is designed to allow you access a remote filesystem in a way that it appears local. This is very powerful and flexible but is overkill for the file warehousing that is usually associated with web applications (i.e.: images related to a product on an ecommerce). We are talking about just file storage, then we need just simple operations such as write, read, delete files. We don’t need to append, change ownership, touch…
      Then NFS offers more than we need, that means overhead. NFS also keeps open a socket between every client and server, adding network overhead even when no work is taken place.
      For simply putting and delete files we can use FTP or SCP. For reading, we can use HTTP over our storage server. In this way, we can use standard Load Balancers, as well. For this purpose, we can even use HTTP servers specialized in delivering images or any light server (lighttpd).
      We can also use HTTP for our writes, using the PUT and DELETE methods. So we can do all our storage over HTTP.

    Scaling Logs

    When you have a cluster of webserver, the problem is that you have many log files instead of just one. Than, you should use one of these methods:

    • Google Analytics or a similar service
    • beacons – you can log just regular web pages – you can use a server just for this purpose
    • a software like spread
    • centralise the database for logs
    • use the load balancer (that is where all the traffic goes through)
    This entry was posted in Linux, Linux Command Line, Microsoft, Web Development. Bookmark the permalink.

    Leave a Reply