What is PostgreSQL?

PostgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department. The POSTGRES project, led by Professor Michael Stonebraker, was sponsored by the Defense Advanced Research Projects Agency (DARPA), the Army Research Office (ARO), the National Science Foundation (NSF), and ESL, Inc.

PostgreSQL is an open-source descendant of this original Berkeley code. It provides SQL92/SQL99 language support and other modern features.

POSTGRES pioneered many of the object-relational concepts now becoming available in some commercial databases. Traditional relational database management systems (RDBMS) support a data model consisting of a collection of named relations, containing attributes of a specific type. In current commercial systems, possible types include floating point numbers, integers, character strings, money, and dates. It is commonly recognized that this model is inadequate for future data processing applications. The relational model successfully replaced previous models in part because of its "Spartan simplicity". However, as mentioned, this simplicity often makes the implementation of certain applications very difficult. Postgres offers substantial additional power by incorporating the following additional concepts in such a way that users can easily extend the system:

inheritance
data types
functions

Other features provide additional power and flexibility:

constraints
triggers
rules
transaction integrity

These features put Postgres into the category of databases referred to as object-relational. Note that this is distinct from those referred to as object-oriented, which in general are not as well suited to supporting the traditional relational database languages. So, although Postgres has some object-oriented features, it is firmly in the relational database world. In fact, some commercial databases have recently incorporated features pioneered by Postgres.

A Short History of Postgres

The object-relational database management system now known as PostgreSQL (and briefly called Postgres95) is derived from the Postgres package written at the University of California at Berkeley. With over a decade of development behind it, PostgreSQL is the most advanced open-source database available anywhere, offering multi-version concurrency control, supporting almost all SQL constructs (including subselects, transactions, and user-defined types and functions), and having a wide range of language bindings available (including C, C++, Java, Perl, Tcl, and Python).

The Berkeley Postgres Project Implementation of the Postgres DBMS began in 1986. The initial concepts for the system were presented in The Design of Postgres and the definition of the initial data model appeared in The Postgres Data Model . The design of the rule system at that time was described in The Design of the Postgres Rules System . The rationale and architecture of the storage manager were detailed in The Postgres Storage System .

Postgres has undergone several major releases since then. The first "demoware" system became operational in 1987 and was shown at the 1988 ACM-SIGMOD Conference. We released Version 1, described in The Implementation of Postgres , to a few external users in June 1989. In response to a critique of the first rule system (A Commentary on the Postgres Rules System ), the rule system was redesigned (On Rules, Procedures, Caching and Views in Database Systems ) and Version 2 was released in June 1990 with the new rule system. Version 3 appeared in 1991 and added support for multiple storage managers, an improved query executor, and a rewritten rewrite rule system. For the most part, releases until Postgres95 (see below) focused on portability and reliability.

Postgres has been used to implement many different research and production applications. These include: a financial data analysis system, a jet engine performance monitoring package, an asteroid tracking database, a medical information database, and several geographic information systems. Postgres has also been used as an educational tool at several universities. Finally, Illustra Information Technologies (since merged into Informix ) picked up the code and commercialized it. Postgres became the primary data manager for the Sequoia 2000 scientific computing project in late 1992.

The size of the external user community nearly doubled during 1993. It became increasingly obvious that maintenance of the prototype code and support was taking up large amounts of time that should have been devoted to database research. In an effort to reduce this support burden, the project officially ended with Version 4.2.

Postgres95

In 1994, Andrew Yu and Jolly Chen added a SQL language interpreter to Postgres. Postgres95 was subsequently released to the Web to find its own way in the world as an open-source descendant of the original Postgres Berkeley code.

Postgres95 code was completely ANSI C and trimmed in size by 25%. Many internal changes improved performance and maintainability. Postgres95 v1.0.x ran about 30-50% faster on the Wisconsin Benchmark compared to Postgres v4.2.

The query language Postquel was replaced with SQL (implemented in the server).
Support for the GROUP BY query clause was also added. The libpq interface remained available for C programs.
In addition to the monitor program, a new program (psql) was provided for interactive SQL queries.
The large object interface was overhauled.
A short tutorial introducing regular SQL features as well as those of Postgres95 was distributed with the source code.
Postgres95 could be compiled with gcc.

PostgreSQL

By 1996, it became clear that the name "Postgres95" would not stand the test of time. We chose a new name, PostgreSQL, to reflect the relationship between the original Postgres and the more recent versions with SQL capability. At the same time, we set the version numbering to start at 6.0, putting the numbers back into the sequence originally begun by the Postgres Project.

The emphasis during development of Postgres95 was on identifying and understanding existing problems in the backend code. With PostgreSQL, the emphasis has shifted to augmenting features and capabilities, although work continues in all areas.

Major enhancements in PostgreSQL include:

Table-level locking has been replaced with multi-version concurrency control, which allows readers to continue reading consistent data during writer activity and enables hot backups from pg_dump while the database stays available for queries.
Important backend features, including subselects, defaults, constraints, and triggers, have been implemented.
Additional SQL92-compliant language features have been added, including primary keys, quoted identifiers, literal string type coercion, type casting, and binary and hexadecimal integer input.
Built-in types have been improved, including new wide-range date/time types and additional geometric type support.
Overall backend code speed has been increased by approximately 20-40%, and backend start-up time has decreased 80% since version 6.0 was released.

SQL

SQL has become the most popular relational query language. The name "SQL" is an abbreviation for Structured Query Language. In 1974 Donald Chamberlin and others defined the language SEQUEL (Structured English Query Language) at IBM Research. This language was first implemented in an IBM prototype called SEQUEL-XRM in 1974-75. In 1976-77 a revised version of SEQUEL called SEQUEL/2 was defined and the name was changed to SQL subsequently.

A new prototype called System R was developed by IBM in 1977. System R implemented a large subset of SEQUEL/2 (now SQL) and a number of changes were made to SQL during the project. System R was installed in a number of user sites, both internal IBM sites and also some selected customer sites. Thanks to the success and acceptance of System R at those user sites IBM started to develop commercial products that implemented the SQL language based on the System R technology.

Over the next years IBM and also a number of other vendors announced SQL products such as SQL/DS (IBM), DB2 (IBM), ORACLE (Oracle Corp.), DG/SQL (Data General Corp.), and SYBASE (Sybase Inc.).

SQL is also an official standard now. In 1982 the American National Standards Institute (ANSI) chartered its Database Committee X3H2 to develop a proposal for a standard relational language. This proposal was ratified in 1986 and consisted essentially of the IBM dialect of SQL. In 1987 this ANSI standard was also accepted as an international standard by the International Organization for Standardization (ISO). This original standard version of SQL is often referred to, informally, as "SQL/86". In 1989 the original standard was extended and this new standard is often, again informally, referred to as "SQL/89". Also in 1989, a related standard called Database Language Embedded SQL (ESQL) was developed.

The ISO and ANSI committees have been working for many years on the definition of a greatly expanded version of the original standard, referred to informally as SQL2 or SQL/92. This version became a ratified standard - "International Standard ISO/IEC 9075:1992, Database Language SQL" - in late 1992. SQL/92 is the version normally meant when people refer to "the SQL standard". A detailed description of SQL/92 is given in Date and Darwen, 1997 . At the time of writing this document a new standard informally referred to as SQL3 is under development. It is planned to make SQL a Turing-complete language, i.e. all computable queries (e.g. recursive queries) will be possible. This is a very complex task and therefore the completion of the new standard can not be expected before 1999.

The Relational Data Model

As mentioned before, SQL is a relational language. That means it is based on the relational data model first published by E.F. Codd in 1970. We will give a formal description of the relational model later (in Relational Data Model Formalities ) but first we want to have a look at it from a more intuitive point of view.

A relational database is a database that is perceived by its users as a collection of tables (and nothing else but tables). A table consists of rows and columns where each row represents a record and each column represents an attribute of the records contained in the table. The Suppliers and Parts Database shows an example of a database consisting of three tables:

SUPPLIER is a table storing the number (SNO), the name (SNAME) and the city (CITY) of a supplier.
PART is a table storing the number (PNO) the name (PNAME) and the price (PRICE) of a part.
SELLS stores information about which part (PNO) is sold by which supplier (SNO). It serves in a sense to connect the other two tables together.

Example 1-1. The Suppliers and Parts Database

SUPPLIER:                   SELLS:
 SNO |  SNAME  |  CITY       SNO | PNO
----+---------+--------     -----+-----
 1  |  Smith  | London        1  |  1
 2  |  Jones  | Paris         1  |  2
 3  |  Adams  | Vienna        2  |  4
 4  |  Blake  | Rome          3  |  1
                              3  |  3
                              4  |  2
PART:                         4  |  3
 PNO |  PNAME  |  PRICE       4  |  4
----+---------+---------
 1  |  Screw  |   10
 2  |  Nut    |    8
 3  |  Bolt   |   15
 4  |  Cam    |   25

The tables PART and SUPPLIER may be regarded as entities and SELLS may be regarded as a relationship between a particular part and a particular supplier.

As we will see later, SQL operates on tables like the ones just defined but before that we will study the theory of the relational model.

Relational Data Model Formalities

The mathematical concept underlying the relational model is the relation. The relation gives the model its name (do not confuse it with the relationship from the Entity-Relationship model). Formally a domain is simply a set of values. For example the set of integers is a domain. Also the set of character strings of length 20 and the real numbers are examples of domains.

A Relation is any subset of the Cartesian product of one or more domains: D1 × D2 × ...
The members of a relation are called tuples. A relation can be viewed as a table (as we already did, remember The Suppliers and Parts Database where every tuple is represented by a row and every column corresponds to one component of a tuple. Giving names (called attributes) to the columns leads to the definition of a relation scheme. A relation scheme R is a finite set of attributes A1, A2, ... Ak. There is a domain Di, for each attribute Ai, 1 <= i <= k, where the values of the attributes are taken from. We often write a relation scheme as R(A1, A2, ... Ak).
Note: A relation scheme is just a kind of template whereas a relation is an instance of a relation scheme. The relation consists of tuples (and can therefore be viewed as a table); not so the relation scheme.

Domains vs. Data Types

We often talked about domains in the last section. Recall that a domain is, formally, just a set of values (e.g., the set of integers or the real numbers). In terms of database systems we often talk of data types instead of domains. When we define a table we have to make a decision about which attributes to include. Additionally we have to decide which kind of data is going to be stored as attribute values. For example the values of SNAME from the table SUPPLIER will be character strings, whereas SNO will store integers. We define this by assigning a data type to each attribute. The type of SNAME will be VARCHAR(20) (this is the SQL type for character strings of length <= 20), the type of SNO will be INTEGER. With the assignment of a data type we also have selected a domain for an attribute. The domain of SNAME is the set of all character strings of length <= 20, the domain of SNO is the set of all integer numbers.

Relational Algebra

The Relational Algebra was introduced by E. F. Codd in 1972. It consists of a set of operations on relations:

SELECT (?): extracts tuples from a relation that satisfy a given restriction. Let R be a table that contains an attribute A. ?A=a(R) = {t ? R ∣ t(A) = a} where t denotes a tuple of R and t(A) denotes the value of attribute A of tuple t.
PROJECT (?): extracts specified attributes (columns) from a relation. Let R be a relation that contains an attribute X. ?X(R) = {t(X) ∣ t ? R}, where t(X) denotes the value of attribute X of tuple t.
PRODUCT (×): builds the Cartesian product of two relations. Let R be a table with arity k1 and let S be a table with arity k2. R × S is the set of all k1 + k2-tuples whose first k1 components form a tuple in R and whose last k2 components form a tuple in S.
UNION (?): builds the set-theoretic union of two tables. Given the tables R and S (both must have the same arity), the union R ? S is the set of tuples that are in R or S or both.
INTERSECT (?): builds the set-theoretic intersection of two tables. Given the tables R and S, R ? S is the set of tuples that are in R and in S. We again require that R and S have the same arity.
DIFFERENCE (? or ∖): builds the set difference of two tables. Let R and S again be two tables with the same arity. R - S is the set of tuples in R but not in S.
JOIN (?): connects two tables by their common attributes. Let R be a table with the attributes A,B and C and let S be a table with the attributes C,D and E. There is one attribute common to both relations, the attribute C. R ? S = ?R.A,R.B,R.C,S.D,S.E(?R.C=S.C(R × S)). What are we doing here? We first calculate the Cartesian product R × S. Then we select those tuples whose values for the common attribute C are equal (?R.C = S.C). Now we have a table that contains the attribute C two times and we correct this by projecting out the duplicate column.

It is sometimes said that languages based on the relational calculus are "higher level" or "more declarative" than languages based on relational algebra because the algebra (partially) specifies the order of operations while the calculus leaves it to a compiler or interpreter to determine the most efficient order of evaluation.