Saturday, February 9, 2008

Relational database management system

A Relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd. Most popular commercial and open source databases currently in use are based on the relational model.
A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables and the relationship among the data is also stored in the form of tables.

History of the term

E. F. Codd introduced the term in his seminal paper "A Relational Model of Data for Large Shared Data Banks", published in 1970. In this paper and later papers he defined what he meant by relational. One well-known definition of what constitutes a relational database system is Codd's 12 rules. However, many of the early implementations of the relational model did not conform to all of Codd's rules, so the term gradually came to describe a broader class of database systems. At a minimum, these systems:

  • presented the data to the user as relations (a presentation in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns, can satisfy this property)
  • provided relational operators to manipulate the data in tabular form

The first systems that were relatively faithful implementations of the relational model were from the University of Michigan; Micro DBMS (1969) and from IBM UK Scientific Centre at Peterlee; IS1 (1970–72) and its followon PRTV (1973–79). The first system sold as an RDBMS was Multics Relational Data Store, first sold in 1978. Others have been Berkeley Ingres QUEL and IBM BS12.
The most popular definition of an RDBMS is a product that presents a view of data as a collection of rows and columns, even if it is not based strictly upon relational theory. By this definition, RDBMS products typically implement some but not all of Codd's 12 rules.
A second, theory-based school of thought argues that if a database does not implement all of Codd's rules (or the current understanding on the relational model, as expressed by Christopher J Date, Hugh Darwen and others), it is not relational. This view, shared by many theorists and other strict adherents to Codd's principles, would disqualify most DBMSs as not relational. For clarification, they often refer to some RDBMSs as Truly-Relational Database Management Systems (TRDBMS), naming others Pseudo-Relational Database Management Systems (PRDBMS).
Almost all commercial relational DBMSes employ SQL as their query language. Alternative query languages have been proposed and implemented, but very few have become commercial products.

List of relational database management systems

Open-source software

Freeware (proprietary)

Proprietary software not available without cost

Historical

Truly relational

 Current

  • Alphora Dataphor (a proprietary virtual, federated DBMS and RAD MS .Net IDE).
  • CsiDB (proprietary C++ library).
  • D flat (unreleased MS .Net academic project).
  • Duro (free C library).
  • Opus (free C libraries).
  • Rel (free Java implementation).
  • Rosetta (free Perl implementation).
  • Aldat (implemented in Relix and JRelix): A pure relational DBMS implemented in McGill University. It features clear syntax and domain algebra operators.

Obsolete

Friday, February 8, 2008

An Overview of Database Management Systems

An Overview of Database Management Systems

There are two approaches to data storage:

  • file-based,
  • database.

File-Based Approaches

File-based approaches to data storage are based on relatively simple data structures, such as the Indexed Sequential Access Method (ISAM), and are usually implemented for a single application.

Files are generally created on an as needed basis to service the data needs of an application. The files are associated with an application. The same data may be repeated on many files and stored under different names. For example, an accounting application may refer to customer name while a purchasing application may refer to buyer name. The physical storage characteristics of the same data may be different for different applications. For example, one application may allow 20 characters for name while another application allows 25 characters for the same name. Different business units are responsible for different data.

Disadvantages of a File-Based Approach

Updates to files may result in inconsistent data across the organization. For example, if an accounting application updates a customer name without notifying other application areas that also maintain customer name, customer name will be stored differently for different applications.

A file-based data storage approach makes it difficult for other applications to access data not owned by their application. Data owned by other applications may be stored in a format not consistent with the retrieval capabilities of another application.

File-based approaches to data storage are tied to applications rather than the entities to which the files refer. File-based approaches do not recognize relationships between entities until such information is needed by an application.

Database Approaches

Database approaches to data storage support the sharing of data across multiple applications with multiple users. Databases are structured in a way that is meaningful to an organization. For example, if an organization maintains information on suppliers and the geographic areas they service, there would be a link in the database between the suppliers and geographic areas. Databases reduce data redundancy.

A Database Management System (DBMS) is the software that handles all database accesses. A DBMS presents a logical view of the data to the users. How this data is stored and retrieved is hidden from the users. A DBMS ensures that the data is consistent across the database and controls who can access what data.

Categories of Database Management Systems

The main categories of DBMSs are:

HIERARCHICAL DATABASES

Description of Hierarchical Database

In the hierarchical structure, data is represented by a simple tree structure. The record type at the top of the tree is usually known as the "root." The simple hierarchical structure consists of a root and a single dependent record type. In general, the root may have any number of dependent records, each of which may have any number of lower-level dependents, and so on, to any number of levels.

The hierarchy view contains records of different types connected by links.

Hierarchical relationships of records are explicitly defined in the data structure. A parent record can have many child records but a child record can have only one parent. There are no many-to-many relationships between records. No dependent record within a hierarchical data structure can exist without its parent record. For this reason, records must be seen in context.

Examples of hierarchical DBMSs include IBM's Information Management System (IMS) and System 2000.

Strengths of Hierarchical Databases

The advantages of a hierarchical database are:

· efficient representation of hierarchical structures,

· efficient single key search and access time (if the hierarchical structure corresponds to application views of the data),

· fast update performance where locality of reference exists (locality of reference states that performance is significantly enhanced when the processing is close to the data being processed).

Weaknesses of Hierarchical Databases

The disadvantages of a hierarchical database are:

· lack of flexibility (non-hierarchical relationships are awkward to represent; redundancy may be required),

· poor performance for non-hierarchical accesses,

· lack of maintainability (changing relationships may require physical reorganization of data).

NETWORK DATABASES

Description of Network Database

In a network database structure, data is represented by records and links. There is no limit to the number of immediate parent segments or the number of immediate child segments a given record occurrence may have.

The network database structure allows the modelling of many-to-many relationships.

DMS and IDMS are examples of network databases.

Strengths of Network Databases

The advantages of a network database are:

· efficient representation of some structures (varies widely with physical implementation),

· more flexibility than a hierarchical approach (all relationships can be represented without redundancy).

Weaknesses of Network Databases

The disadvantages of a network database are:

· machine performance (physical implementations that perform well for one type of network may perform poorly for another type),

· maintainability (changing relationships may require physical reorganization of data),

· lack of robustness. A failure in the system can leave dangling references to data which must somehow be recovered,

· update overheads in multi‑key databases.

RELATIONAL DATABASES

The relational model for databases provides the basic DBMS characteristics. In addition, an RDBMS also conforms to Codd's model.

Relational Database Characteristics

Dr. Codd established 12 rules to which a DBMS must conform to be considered relational. DBMSs vary in the way in which they comply with these rules, however, commercial relational databases generally conform to these rules.

Strengths of RDBMS

Flexible and well-established.

Sound theoretical foundation and use over many years has resulted in stable, standardized products available.

Standard data access language through SQL.

Costs and risks associated with large development efforts and with large databases are well understood.

The fundamental structure, i.e., a table, is easily understood and the design and normalization process is well defined.

Weaknesses of RDBMS

Performance problems associated with re-assembling simple data structures into their more complicated real-world representations.

Lack of support for complex base types, e.g., drawings.

SQL is limited when accessing complex data.

Knowledge of the database structure is required to create ad hoc queries.

Locking mechanisms defined by RDBMSs do not allow design transactions to be supported, e.g., the "check in" and "check out" type of feature that would allow an engineer to modify a drawing over the course of several working days.

OBJECT DATABASES

With the increased awareness of the object-oriented paradigm, many DBMSs claim to be object-oriented databases.

The object-oriented model provides the basic DBMS characteristics. In addition Object-oriented Database Management Systems provide the following features.

Mandatory

· Support for Complex Objects

An ODBMS must provide support for data types that allow storage of complex objects. An RDBMS supports only basic types, (e.g., integer, character), that are assembled into table rows to define objects.

· Identity

An object must be uniquely identified.

· Encapsulation

Hiding information and only allowing access through the public interface is a basic concept in object-orientation. Strict implementations enforce access to data through programs, however, some ODBMS products allow query access to the data.

· Classes and Types

An ODBMS must allow definition of classes and data types by the user. In addition these must not be restricted by the implementation. User defined types must be treated the same as the product supplier's base types in the database functionality.

· Inheritance

Inheritance is a basic concept in object-orientation that is a major factor in re-use. Inheritance of data and methods is a primary feature of an object-oriented programming language (OOPL).

· Dynamic Binding and Polymorphism

This characteristic allows different implementations of semantically similar methods with the same name in different classes.

· Computational Completeness

The integrated language must be computationally complete.

For example, SQL is relationally complete since a SELECT statement can be constructed to retrieve any subset of data within an RDBMS. SQL is not computationally complete (it is sometimes called a data sublanguage) since there are computations that cannot be done using SQL constructs. The ODBMS language cannot have this restriction.

· Extensibility

The user must be able to add non-restricted types to the database.

Strengths of ODBMS

Representations of complex structures allow creation of a model that more closely resembles its real world counterpart.

Storage of complex objects results in better performance.

The model exhibits better coupling and cohesion characteristics. This results in a more maintainable and flexible database.

Weaknesses of ODBMS

Lack of maturity in the products. This leads to more risk in employing the product. There is also a lack of product standardization. Database standards are emerging.

There is no equivalent of SQL for object-oriented databases. Work is being done on establishing standards.

There is much less experience with project development. Hence, there is a lack of information on costs and schedules. In addition, there is added training required for the team because of the lack of familiarity with the object paradigm.

There is no consistent, solid theoretical basis to support ODBMS products. This raises questions about the completeness of the implementations. In addition, there is a plethora of analysis and design approaches.

Selecting a Database

Database technology may not always be needed, depending on the sophistication of the system's data structures, complexity of data, and the need for features provided by databases.

In selecting a database, it is important to consider, in addition to the actual data management capabilities, the extent of the development environment provided or supported. Since the physical view of data can be separated from the application view, the development environment provided or supported will have a significant effect on the ability to generate new applications or maintain existing applications using this data. Application engineers and developers do not need to know about the storage structure or access strategy of data to develop or modify applications.

Support for a standard query language, such as SQL, is also an important criteria. By developing applications using a standard query language, flexibility is achieved since the underlying database product can be changed or modified without necessarily affecting the applications that access the data. Note, however, that some DBMSs have added functionality not supported by the standard query language. Use the added functionality with caution.

There are also many other criteria to consider when selecting a database which I will cover in later posts.

Thursday, February 7, 2008

Database Normalization Overview

Database Normalization Overview

Use Normalization to reduce redundancy in your logical data model.

Method

Apply the rules of Normalization while conducting Entity-Relationship Modelling. The basic steps in the normalization process result in data structures which are in:

  1. First Normal Form,
  2. Second Normal Form,
  3. Third Normal Form.

These three normal forms are designed to prevent update anomalies and data inconsistencies. First normal form is concerned with the shape of the record and the preliminary identification of keys. Second normal form and third normal form are concerned with resolving problems associated with the interrelationship of keys and dependent attributes.

Other normal forms include Boyce-Codd, fourth, and fifth. Boyce-Codd normal form eliminates redundancy that can sometimes occur even in third normal form tables. Fourth normal form eliminates multi-dependence, which occurs when there are two independent many-to-many relationships. Fifth normal form eliminates cyclic dependency, which involves a three-way relationship among three attributes. In practice, fourth normal form occurs intuitively as part of the exercise to produce third normal form. Boyce-Codd normal form and fifth normal form are difficult to use and rarely needed.

Wednesday, February 6, 2008

Database Normalization: First Normal Form

Database Normalization: First Normal Form

Data is in first normal form when organized into relations (records) that have no repeating groups of data items.

Description of a Relation

A relationis a flat file, or a two dimensional table of data elements, where each row is equivalent to a logical record, and each column contains the specific values for an attribute of the relation.  A relation should only contain records of identical format.  Relations are often referred to as records when describing normalization.
 
Note:  There is no connection between the terms "relation" and "relationship."

First Rule of Normalization

A relation may not contain any repeating groups of attributes.

Second Rule of Normalization

Internal structures within data elements of a relation must be split into separate attributes.

Applying the Rules to Achieve First Normal Form

Under first normal form, all occurrences of a record type must contain the same number of fields.  First normal form includes derived attributes but excludes variable repeating fields and groups.

To normalize data into first normal form, apply the first two rules by:

·        moving attributes which repeat multiple times for an entity to a separate entity.  Ensure attributes do not repeat in the new entity,

·        splitting internal structures into multiple attributes.

Most programming languages allow programmers to create records that are not flat, i.e., contain repeating groups.  For example, the "OCCURS" statement in COBOL allows repeating groups.  Records that contain repeating groups do not allow a specific occurrence to be uniquely identified by the key of the complete record.  In general, a non-flat record is normalized by converting it into two or more flat records.
 
For example, to place this record in first normal form, the repeating group is removed and a separate record is created.

Tuesday, February 5, 2008

Database Normalization: Second Normal Form

Database Normalization: Second Normal Form

Second normal form is achieved when each attribute in a record is functionally dependent on the entire key of that record.

Candidate Keys

The key of a normalized record must have the following characteristics:

·        It must uniquely identify the record for each occurrence of the record.

·        It must not contain redundant data, i.e., no attribute in the key can be eliminated without destroying the unique identification of the key.

Sometimes, more than one attribute or group of attributes could be the key of a record.  Alternate choices are called candidate keys.  Only one candidate key can be designated the primary key.
 
With the introduction of candidate keys, a more comprehensive definition of second normal form is as follows:

A record is in second normal form if it is in first normal form and every non-key attribute of the record is fully functionally dependent on each candidate key for the record.

Description of Functional Dependency

Functional Dependency is defined as:

Attribute X of a relation is functionally dependent on attribute Y (the key) of that relation if attribute X has only one value associated with a specific value for attribute Y.

In other words, if the value of attribute Y, the key, is known, the associated value of attribute X can be determined.  For example, in an employee record, the EMPLOYEE START DATE is functionally dependent on EMPLOYEE NUMBER.  For each EMPLOYEE NUMBER there is only one EMPLOYEE START DATE.  To access the EMPLOYEE START DATE, you need to know the EMPLOYEE NUMBER, the key.

An attribute can be functionally dependent on a group of attributes rather than only on an attribute.
 

Full Functional Dependency

If the key of a relation is a concatenation of attributes, attributes are said to be fully functionally dependent on the key if they are functionally dependent on the whole key, and not on a subset of it.

Third Rule of Normalization

Each attribute in a relation must be dependent on the entire key that uniquely identifies the relation and not on a partial key.

Applying the Rules to Achieve Second Normal Form

 
To normalize data into second normal form, apply the third rule by:

·        removing any attributes which are dependent on part of the key only,

·        forming a new relation for that attribute, identified by the specific part of the original key.

In general, to achieve second normal form, every attribute in a record should be dependent on the entire key.  If an attribute is not dependent on the entire key, the attribute is split off to form a separate relation.

Problem with a Record Not in Second Normal Form

The basic problems with the design of the example Inventory Record are:

·        The warehouse address is repeated in every record that refers to a part stored in that warehouse.

·        If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.

·        Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.

·        If, at some point in time, there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.

Monday, February 4, 2008

Database Normalization: Third Normal Form

Database Normalization: Third Normal Form

A record is in third normal form if:

·        it is in second normal form and,

·        every non-key attribute of the record is non-transitively dependent on the primary key of the record.

Transitive Dependency

A record in second normal form can still have an update anomaly.  It may contain an attribute that is not a key but does identify other attributes.

An attribute X within a relation may be dependent on a non-key attribute Y which in turn is dependent on the key.  Unless the reverse is true, then the attribute X is dependent on the key only through attribute Y.  Without Y, the dependence is lost.  This state is referred to as transitive dependency.
 
Removing transitive dependencies puts data in third normal form.

Fourth Rule of Normalization

Non-key attributes of a relation must be fully dependent on the key and must be independent of any other non-key attribute.

Applying the Rules to Achieve Third Normal Form

To normalize data to achieve third normal form, apply the fourth rule of normalization:

·        Identify and remove any attribute which is actually dependent on a non-key attribute in the same relation (i.e., dependent upon the key through another attribute).

·        Create a new relation for this attribute identified by the relevant key.

Attributes are now non-transitively dependent upon the key.

Third normal form is violated when a non-key field is fully functionally dependent on another non-key field.

To satisfy third normal form, the example Employee Record should be decomposed.
 
The problems with a record not in third normal form are the same as those caused by violations of second normal form:

·        The location of the department is repeated in the record of each employee assigned to that department.

·        If the location of the department changes, every such record must be updated.

·        Because of the redundancy, the data may become inconsistent, with different records showing different locations for the same department.

·        If a department has no employees, there may be no record in which to keep the department's location.

In general, a record is in third normal form if it is in second normal form and if every field is either part of the key or provides a (single-valued) fact about the whole key and nothing else.  If an entity is in third normal form, the non-key fields are said to be non-transitively dependent upon the key.

Sunday, February 3, 2008

Database Normalization: The Benefits

Database Normalization: The Benefits

Normalization:

  • serves as the basis for Physical Database Design, ensuring that customer requirements are properly satisfied and that new requirements are easier to accommodate,
  • makes the design of a modular system easier,
  • makes the database easier to maintain,
  • ensures structural stability of data,
  • prevents various updating anomalies which can occur in non-normalized record structures,
  • enables record processing by a set of simple operators.

While performance considerations often prohibit the direct implementation of normalized records, the process of normalization  results in a model of the "ideal" data structure and relationships.  Consequently, it is a valuable procedure whether or not a relational database system is used.

Saturday, February 2, 2008

Database Normalization: Performance & Storage Trade-offs

Database Normalization: Performance & Storage Trade-offs

Normal form (reduced redundancy) tends to penalize efficient retrieval of data, since data which may have been retrievable from one table in an unnormalized design may have to be retrieved from several tables in the normalized form.  Normal form, however, usually improves the performance of updates and deletions.

Although normalization results in more tables, less data storage is typically required because non-normalized data has more value redundancy.

For example, moving the attribute WAREHOUSE IDENTIFICATION CODE from part records means that a separate record is needed for each separate WAREHOUSE.  However, data storage is reduced because each part record does not contain a value for WAREHOUSE IDENTIFICATION CODE (assuming that the additional storage overhead of the additional tables is insignificant).

Normalization enhances the integrity of the data by minimizing data redundancy and inconsistency.  However, there may be additional performance costs for retrieval of data in certain applications.  As a result, the physical database design may deviate from the normalized form due to performance considerations.  Consider an application that wants the addresses of all warehouses that stock a certain part.  In the unnormalized form, the application searches for one record type.  With the normalized design, the application has to search for and join two record types.