Saturday, March 15, 2008

Preliminary Database Design Steps

Physical database design is concerned with the way data is physically stored.  A physical view of a record structure shows how the logical data model is implemented.

The starting point for physical database design is a logical data model in third normal form.  The physical database design is derived from the logical model.  A conscious decision may be made to denormalize some files for performance reasons.

Steps in Preliminary Database Design

The primary steps to convert the logical data model to the preliminary physical data model are as follows:

  • Convert entities to files (tables in relational database design).
  • Convert relationships (access paths) to keys (foreign keys in relational database design).

When examining access paths, make any changes required to the database to improve performance and ease of use.  These could include:

  • defining additional keys to accommodate access paths,
  • denormalizing files to reduce access time,
  • changing file relationships to shorten chain lengths,
  • tuning the database to favour more critical I/Os.

 

  • Convert attributes to data elements (columns in relational database design).
  • If necessary, denormalize data, add candidate keys, or define hierarchical relationships instead of carrying the owner record's primary key.

Note:  Database design is an iterative process, making it is difficult to draw an arbitrary line between preliminary and detailed design activity.  However, the objective of preliminary database design is to produce a schema against which developers can begin to write code, without focusing on the internal, physical tuning of the schema.  Tuning can be done later, with little or no impact on the code.

Importance of a Stable Database Design

Changes to the database design can have a wide impact on software structure and logic design so the database design should be as complete and stable as possible before detailed application design begins.

 

Physical Database Design Steps & Performance Considerations

The main steps in physical database design are as follows:

  • Examine performance considerations (see below).
  • Determine physical file characteristics considering the elements of physical design.
  • Complete the physical file design.  Assign physical files to devices, determine partitions, and establish file blocking.
  • Walk-through the database design with team members and, in some cases, customer representatives, to use the knowledge of other team members to uncover design flaws.
  • If performance is a major issue, perform a database design review to detect possible performance problems.
  • Continue to refine the database throughout design and development.  Revisions, however, must be tightly controlled and circulated immediately to all affected parties to avoid endless cycles of change.

 

DBMS PERFORMANCE CONSIDERATIONS

The following performance considerations must be examined regardless of the DBMS selected:

  • required performance level, taking into consideration:
  • the number of transactions per second,
  • volume of records,
  • size of records,
  • amount of missing data (i.e., if the application requires large amounts of non-valued data fields to be managed),  
  • amount of disk space, I/O, and CPU overhead required to support the DBMS,
  • capability to optimize high-use transactions,
  • capability to compile queries (or must queries be interpreted and optimized at every invocation?),
  • query language support that manages and limits the resources consumed by on-line queries,
  • report writers support that can minimize the number of passes over the data required to produce multiple reports,
  • implementation of optional data compression techniques,
  • impact on performance due to the DBMS implementation of checkpoint/recovery and deadlock detection techniques,
  • impact on performance due to the DBMS implementation of optional security, data dictionary features, and alternative views of the data.

 

Elements of Physical Database File Design

Physical file design includes those aspects of database design that are usually not visible to the users and applications.  The objective of physical file design is to optimize performance.

Physical data independence should not be given up unless significant improvements in machine performance can be recognized.  Software dependence on physical data structures adds complexity to software design and makes maintenance more difficult.  Database systems allow software design decisions to be made independent of the physical data organization and storage.

Physical file design considers the physical layout of data, addressing techniques, compaction techniques, and storage devices.

Physical Layout

Physical layout of the data is concerned with:

  • the organization of files and data elements,
  • partitioning of data,
  • clustering of data,
  • file blocking factors to maximize performance and/or reduce space requirements.

Addressing Techniques

Addressing techniques for flat files and relational database tables include:

  • sequential,
  • indices,
  • inverted lists,
  • hashed.

For network and hierarchical databases, addressing techniques include such things as:

  • contiguous lists,
  • indices,
  • pointer methods,
  • bitmaps.

Compaction Techniques

Compaction techniques include:

  • the use of packed fields to reduce space requirements and improve response time,
  • compression of images according to various standards, for example, Joint Photographic Experts Group (JPEG), Moving Pictures Experts Group (MPEG), Digital Video Interactive (DVI),
  • compression of text and executables.
  • Compressed data can reduce space requirements and help performance by:
  • increasing the amount of data that can be held in main memory at one time,
  • reducing the volume of data to be transmitted over communication lines.

However, there is some overhead associated with compressing/decompressing data.  The tradeoffs must be evaluated.

Storage Devices

Storage devices are selected to define the storage hierarchy such as:

  • main memory,
  • solid state disk,
  • magnetic disk,
  • optical disk,
  • tape.

The storage device used should match the response time requirement.  Multiple devices may be used in a buffering or caching scheme to improve performance.

 

Physical Database Design Performance Review Steps

When high performance is required or high transaction volumes are involved, a formal and detailed review of the database design should be conducted to detect possible performance problems.

The transaction load of the system is analyzed and the number of physical I/Os are determined.  The I/O total is then evaluated in terms of the selected hardware configuration and hardware capacities.

Database design review involves the following steps:

  • Identify the critical transactions.  Critical transactions are identified when the system architecture is defined.  They are those transactions that have high volume or which require high performance.
  • For each critical transaction, identify system resources consumed:
  • Identify the logical I/Os.
  • Determine the physical I/Os required for each logical I/O.  For example, a write to a file with secondary keys will require an additional write for each secondary key.
  • Use projected volumes to determine the number of physical I/Os per second.
  • Determine the true number of physical I/Os per second.  Consider hardware features such as:
  • I/Os satisfied by disk cache.  For example, if the cache hit‑rate is 50 percent then the number of expected physical reads is one half the number of physical reads,
  • block splits of index or data blocks,
  • extra writes for mirrored disks (e.g., Tandem),
  • audit trails,
  • record locking/unlocking.
  • Identify System Bottlenecks.
  • Compare the hardware configuration capabilities to the expected transaction load and resulting resource utilization.
  • Propose solutions to eliminate system bottlenecks, such as:
  • Revise the database design.
  • Remove functionality that is resource intensive but of marginal benefit.
  • Move some functionality to off-hours.
  • Accept degraded performance during peak times.
  • Expand the hardware configuration.
  • Perform Technical Benchmarking if proof is required.

DATABASE DESIGN PERFORMANCE REVIEW EXAMPLE

Example: Check-out Transaction for a Public Library

The example walks through the database design review for a public library circulation system in a Tandem environment using the Enscribe relational file system.

Circulation check-out has been identified as a critical transaction.

Background: Components of the Check-out Transaction

The Check-out transaction is made up of two components: patrons and items.  The logical steps involved are:

  • Wand patron card, read patron record.
  • Wand each item, read item record.
  • Read patron activity, delete patron activity (approximately two percent of the time).
  • Read reserves, rewrite reserves (approximately two percent of the time).
  • Enqueue reserve (approximately two percent of the time).
  • Read patron, rewrite patron (approximately five percent of the time).
  • Write patron activity, rewrite item.
  • Post circulation statistics (buffered every 30 minutes).

Note:  Because steps three to six and step eight occur a very small percentage of the time, their reads and writes can be excluded from the totals.

Identify the Logical I/Os

The Critical Transaction Analysis form for Logical I/Os is used to identify the logical I/Os required to complete a single Check-out transaction.

The form lists the major transaction components.  The logical steps performed to complete the transaction are listed, with the database files involved.  Connecting the logical steps and the database files is an arrow indicating the number of logical reads, writes, and inter-process communications (IPC) required.  At the bottom of the form the reads, writes, and IPCs are totalled for each major component.

Identify Physical I/Os

The Critical Transaction Analysis form for Physical I/Os is used to identify the physical I/Os required for each critical transaction (based on the logical I/Os).

To complete this form, it is necessary to have an in-depth understanding of the file's physical makeup, the number of index levels required for the different length keys (this information should be derived from analysis), and the internal behaviour of the DBMS or file system.  This form uses the same diagram as the logical I/Os but the read, write, and IPC numbers are updated to reflect the physical I/O requests for each logical I/O issued.

In this example, the number of reads in steps three to six has increased enough to make them significant.  After calculating their percentage weight, the number of reads per transaction to perform steps three to six is approximately 0.48 (close enough to 0.5 to be rounded to 1.0).

Transaction Summary - Physical I/Os

The Transaction - I/O Summary form for Physical I/O Requests is used to total the number of physical I/Os per second (both peak and average) for each critical transaction, based on:

  • the number of physical I/Os per second,
  • projected volumes.

The numbers in the three columns indicating arrival rates are supplied by the customer.  The "per trans" columns under Reads and Writes is supplied from the totals listed as physical I/Os requested for each transaction.  The "per sec" column is calculated by multiplying "peak per sec" by "per trans."

Transaction Summary - Actual Physical I/Os

The Transaction - I/O Summary form for Actual Physical I/Os is used to identify the "actual" or "true" physical I/O's for each critical transaction based on:

  • the number of physical I/Os requested for each transaction,
  • hardware and operating system feature, such as:
    • cache memory,
    • buffered writes,
    • mirrored database files,
    • audit trails,
    • mirrored audit trails.

The example used is a subset of the transactions for the Circulation system.  In this Tandem environment, both the database files and audit trails are mirrored.  The example also makes an assumption of a 50 percent cache hit rate for reads and writes.

Using these assumptions, the calculations used are:

reads =    number of physical I/Os requested / 2 (assuming 50 percent cache hits)

writes =    number of physical I/Os requested * 2 (for mirrored database files) 
                                           * 2 (mirrored audit trails)
                                           / 2 (50 percent cache hits)

writes (net result of above) =
                                           number of physical I/Os requested * 2

These actual physical I/O totals can now be overlaid on the hardware configuration to check for bottlenecks.

No comments: