28 March 2010

Database Architecture with SQL Server 2008 - Part 3, File Groups, Files & Partitions

In our last post we discussed the different storage options we have and which is best in what situation. I also said in that article that most of the stuff being discussed there is out of the area of databases itself as storage is mainly handled by the operating system and database literally don't care if it is a physical drive or a network drive as long as it can read and write on it. but saying that, Database's main work is to store and retrieve the data, and in this article we will discuss how SQL Server does that.

SQL Server stores its data in its files, there are two types of these files.
  1. Data File -- Where the actual data is stored. (.mdf, .ndf file extension)
  2. Log File -- Where the transaction log is maintained. (.ldf file extension)
SQL Server organizes its data into a data structure called "Data Pages", each data page is of 8 kilo bytes. A Data file consists of series of such data pages, the actual data is spread across these data pages, if the data in a table row is more that 8K, its spread against multiple data pages, otherwise same data page can be used for multiple rows of data, For Large columns such as varchar(max) or BLOB the data is not stored in the same data page with the rest of data for that row but an address pointer is stored instead which addresses to the data page in which the actual data for that column-row is stored.

SQL Server gives you the flexibility to configure the storage location of these files, which is pretty obvious and its behavior which is not very clear. To understand this first let me explain you what a File Group is

File Group

A file group is a collection of data files that serves as a logical storage unit for database, so when you define the database objects that are physically stored such as Tables & Indexes you specify the File group you are using for the object, There can be multiple files groups associated with a database instance and each file group can have multiple files in it. So the big question is what do we achieve by doing so?

In my previous posts I have already explained that the storage is finite, small files perform better than large files, distributed information is faster to access than information consolidated in one single file. so making all this presumptions basis of our argument its fair to say that if a database has multiple data files, located on different physical drives it will perform better and will have a better capacity planning than a single data file database.

One thing to note here is the Log file does not belong the a file group and normally there is one log file per database. Log file as the name implies, stores the log of all the changes (inserts, updates, deletes) you made in your database, its used by the DBAs to track the changes and rollback if required. Log files are also utilized in certain backups.

How File Groups & Files works

Now as we have understood what File groups and files are, let us focus on how they work. When you define a File group there are two important properties you configure.

  1. Read-Only -- This specifies the data in objects built on this file group cannot change. Normally this is set later on and not at the time of database creation. It is used in the cases of Lookup tables or historical data table where you want to ensure that there is no modifications for such tables.
  2. Default -- There can be only one default File group. At the time of object creation if you do not specify any file group the default one will be utilized. The default file group also contains some meta data about the database which the other file groups does not have.




In the example above I have created two File groups, PRIMARY & SECONDARY. Now once the File groups are defined we will now create files under it.



As you can see there are four files here, 3 of them are data file with file type "Rows Data", while one is a log file. I have associated two files to the PRIMARY file group while the other to the SECONDARY file group. The four important configurations for files are
  1. File Group -- Which file group this file belongs to
  2. Initial Size -- When the database is created what should be the file size by default.
  3. Auto Growth -- How the file grows when the data in the file reaches the file size.
  4. Path -- Where the file will be stored.
In the example we specified the initial size to be 2 MB for the data files and 1 MB for log file. so in the image below you can see the files created with these sizes.


Now once this is defined we are ready to create our tables and other objects. As you can see below that when we are creating an object we can select which file group this object will be created on.

SQL Server handles how the data will be distributed across the files in a file groups and it follows an intelligent round robin method by taking into account the file size and auto growth parameters.

Partitions

Now everything is good here, but what if one of your table has or it could have massive amount of data and this results in performance issues or storage capacity problems. What we have learned that the objects are created on file groups now how can we divide the storage of an object across multiple file groups to resolve the performance issues. This is where Partitions are used. Partition divides a Table across multiple file groups using a systematic scheme.

Partitioning is achieved by three steps.

  1. Partition Function -- This is a function that defines how we will distribute the data into partitions.
  2. Partition Scheme -- Here we tell which file group will be used for each partition created by the partition funciton
  3. Table Partition -- Once we define the above two we create the table on the partition scheme and the rest is taken care by the SQL Server itself.
So now lets take this example, We have a product table which we want to partition, we are dividing the table into two partitions based on its primary key ProductId, We want all the records with primary key less than 50000 to go in PRIMARY File group while the rest should go in the SECONDARY file group. So here how we will do this.

First we will define our Partition function.

Now based on this function we will create the partition scheme

And than assign this partition scheme to our table.

and finally tell the SQL Server that use the ProductId column for partitioning through Partition Column List.

8 comments:

  1. nice read!

    Clicking on the images opens a new browser window so is a bit annoying to close the window everytime. Have a look at:

    http://www.c6software.com/Products/PopBox/

    Try adding the javascript so the images open in the same window without a new browser popping up.
    Tc

    ReplyDelete
  2. Thanks for the feedback, will take care of that in my later posts

    ReplyDelete
  3. Nice article, can you post an article on how Transactional Log file works (architecture).

    ReplyDelete
  4. Impressive articles.
    Keep up the good work...!!

    ReplyDelete
  5. brother is a file group collection of mdf,ndf's,ldf allocated to a single database ?

    ReplyDelete
  6. Its a very good artical on Database Architecture.

    ReplyDelete
  7. nice article...it helped alot..:)

    ReplyDelete