Copyright © 1996–2024 The PostgreSQL Global Development Group
Legal Notice
PostgreSQL is Copyright © 1996–2024 by the PostgreSQL Global Development Group.
Postgres95 is Copyright © 1994–5 by the Regents of the University of California.
Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies.
IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN “AS-IS” BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
Table of Contents
Table of Contents
This book is the official documentation of PostgreSQL. It has been written by the PostgreSQL developers and other volunteers in parallel to the development of the PostgreSQL software. It describes all the functionality that the current version of PostgreSQL officially supports.
To make the large amount of information about PostgreSQL manageable, this book has been organized in several parts. Each part is targeted at a different class of users, or at users in different stages of their PostgreSQL experience:
Part I is an informal introduction for new users.
Part II documents the SQL query language environment, including data types and functions, as well as user-level performance tuning. Every PostgreSQL user should read this.
Part III describes the installation and administration of the server. Everyone who runs a PostgreSQL server, be it for private use or for others, should read this part.
Part IV describes the programming interfaces for PostgreSQL client programs.
Part V contains information for advanced users about the extensibility capabilities of the server. Topics include user-defined data types and functions.
Part VI contains reference information about SQL commands, client and server programs. This part supports the other parts with structured information sorted by command or program.
Part VII contains assorted information that might be of use to PostgreSQL developers.
PostgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department. POSTGRES pioneered many concepts that only became available in some commercial database systems much later.
PostgreSQL is an open-source descendant of this original Berkeley code. It supports a large part of the SQL standard and offers many modern features:
Also, PostgreSQL can be extended by the user in many ways, for example by adding new
And because of the liberal license, PostgreSQL can be used, modified, and distributed by anyone free of charge for any purpose, be it private, commercial, or academic.
The object-relational database management system now known as PostgreSQL is derived from the POSTGRES package written at the University of California at Berkeley. With decades of development behind it, PostgreSQL is now the most advanced open-source database available anywhere.
The POSTGRES project, led by Professor Michael Stonebraker, was sponsored by the Defense Advanced Research Projects Agency (DARPA), the Army Research Office (ARO), the National Science Foundation (NSF), and ESL, Inc. The implementation of POSTGRES began in 1986. The initial concepts for the system were presented in [ston86], and the definition of the initial data model appeared in [rowe87]. The design of the rule system at that time was described in [ston87a]. The rationale and architecture of the storage manager were detailed in [ston87b].
POSTGRES has undergone several major releases since then. The first “demoware” system became operational in 1987 and was shown at the 1988 ACM-SIGMOD Conference. Version 1, described in [ston90a], was released to a few external users in June 1989. In response to a critique of the first rule system ([ston89]), the rule system was redesigned ([ston90b]), and Version 2 was released in June 1990 with the new rule system. Version 3 appeared in 1991 and added support for multiple storage managers, an improved query executor, and a rewritten rule system. For the most part, subsequent releases until Postgres95 (see below) focused on portability and reliability.
POSTGRES has been used to implement many different research and production applications. These include: a financial data analysis system, a jet engine performance monitoring package, an asteroid tracking database, a medical information database, and several geographic information systems. POSTGRES has also been used as an educational tool at several universities. Finally, Illustra Information Technologies (later merged into Informix, which is now owned by IBM) picked up the code and commercialized it. In late 1992, POSTGRES became the primary data manager for the Sequoia 2000 scientific computing project.
The size of the external user community nearly doubled during 1993. It became increasingly obvious that maintenance of the prototype code and support was taking up large amounts of time that should have been devoted to database research. In an effort to reduce this support burden, the Berkeley POSTGRES project officially ended with Version 4.2.
In 1994, Andrew Yu and Jolly Chen added an SQL language interpreter to POSTGRES. Under a new name, Postgres95 was subsequently released to the web to find its own way in the world as an open-source descendant of the original POSTGRES Berkeley code.
Postgres95 code was completely ANSI C and trimmed in size by 25%. Many internal changes improved performance and maintainability. Postgres95 release 1.0.x ran about 30–50% faster on the Wisconsin Benchmark compared to POSTGRES, Version 4.2. Apart from bug fixes, the following were the major enhancements:
The query language PostQUEL was replaced with
SQL (implemented in the server). (Interface
library libpq was named after PostQUEL.)
Subqueries
were not supported until PostgreSQL
(see below), but they could be imitated in
Postgres95 with user-defined
SQL functions. Aggregate functions were
re-implemented. Support for the GROUP BY
query clause was also added.
A new program (psql) was provided for interactive SQL queries, which used GNU Readline. This largely superseded the old monitor program.
A new front-end library, libpgtcl
,
supported Tcl-based clients. A sample shell,
pgtclsh
, provided new Tcl commands to
interface Tcl programs with the
Postgres95 server.
The large-object interface was overhauled. The inversion large objects were the only mechanism for storing large objects. (The inversion file system was removed.)
The instance-level rule system was removed. Rules were still available as rewrite rules.
A short tutorial introducing regular SQL features as well as those of Postgres95 was distributed with the source code
GNU make (instead of BSD make) was used for the build. Also, Postgres95 could be compiled with an unpatched GCC (data alignment of doubles was fixed).
By 1996, it became clear that the name “Postgres95” would not stand the test of time. We chose a new name, PostgreSQL, to reflect the relationship between the original POSTGRES and the more recent versions with SQL capability. At the same time, we set the version numbering to start at 6.0, putting the numbers back into the sequence originally begun by the Berkeley POSTGRES project.
Many people continue to refer to PostgreSQL as “Postgres” (now rarely in all capital letters) because of tradition or because it is easier to pronounce. This usage is widely accepted as a nickname or alias.
The emphasis during development of Postgres95 was on identifying and understanding existing problems in the server code. With PostgreSQL, the emphasis has shifted to augmenting features and capabilities, although work continues in all areas.
Details about what has happened in PostgreSQL since then can be found in Appendix E.
The following conventions are used in the synopsis of a command:
brackets ([
and ]
) indicate
optional parts. Braces
({
and }
) and vertical lines
(|
) indicate that you must choose one
alternative. Dots (...
) mean that the preceding element
can be repeated. All other symbols, including parentheses, should be
taken literally.
Where it enhances the clarity, SQL commands are preceded by the
prompt =>
, and shell commands are preceded by the
prompt $
. Normally, prompts are not shown, though.
An administrator is generally a person who is in charge of installing and running the server. A user could be anyone who is using, or wants to use, any part of the PostgreSQL system. These terms should not be interpreted too narrowly; this book does not have fixed presumptions about system administration procedures.
Besides the documentation, that is, this book, there are other resources about PostgreSQL:
The PostgreSQL wiki contains the project's FAQ (Frequently Asked Questions) list, TODO list, and detailed information about many more topics.
The PostgreSQL web site carries details on the latest release and other information to make your work or play with PostgreSQL more productive.
The mailing lists are a good place to have your questions answered, to share experiences with other users, and to contact the developers. Consult the PostgreSQL web site for details.
PostgreSQL is an open-source project. As such, it depends on the user community for ongoing support. As you begin to use PostgreSQL, you will rely on others for help, either through the documentation or through the mailing lists. Consider contributing your knowledge back. Read the mailing lists and answer questions. If you learn something which is not in the documentation, write it up and contribute it. If you add features to the code, contribute them.
When you find a bug in PostgreSQL we want to hear about it. Your bug reports play an important part in making PostgreSQL more reliable because even the utmost care cannot guarantee that every part of PostgreSQL will work on every platform under every circumstance.
The following suggestions are intended to assist you in forming bug reports that can be handled in an effective fashion. No one is required to follow them but doing so tends to be to everyone's advantage.
We cannot promise to fix every bug right away. If the bug is obvious, critical, or affects a lot of users, chances are good that someone will look into it. It could also happen that we tell you to update to a newer version to see if the bug happens there. Or we might decide that the bug cannot be fixed before some major rewrite we might be planning is done. Or perhaps it is simply too hard and there are more important things on the agenda. If you need help immediately, consider obtaining a commercial support contract.
Before you report a bug, please read and re-read the documentation to verify that you can really do whatever it is you are trying. If it is not clear from the documentation whether you can do something or not, please report that too; it is a bug in the documentation. If it turns out that a program does something different from what the documentation says, that is a bug. That might include, but is not limited to, the following circumstances:
A program terminates with a fatal signal or an operating system error message that would point to a problem in the program. (A counterexample might be a “disk full” message, since you have to fix that yourself.)
A program produces the wrong output for any given input.
A program refuses to accept valid input (as defined in the documentation).
A program accepts invalid input without a notice or error message. But keep in mind that your idea of invalid input might be our idea of an extension or compatibility with traditional practice.
PostgreSQL fails to compile, build, or install according to the instructions on supported platforms.
Here “program” refers to any executable, not only the backend process.
Being slow or resource-hogging is not necessarily a bug. Read the documentation or ask on one of the mailing lists for help in tuning your applications. Failing to comply to the SQL standard is not necessarily a bug either, unless compliance for the specific feature is explicitly claimed.
Before you continue, check on the TODO list and in the FAQ to see if your bug is already known. If you cannot decode the information on the TODO list, report your problem. The least we can do is make the TODO list clearer.
The most important thing to remember about bug reporting is to state all the facts and only facts. Do not speculate what you think went wrong, what “it seemed to do”, or which part of the program has a fault. If you are not familiar with the implementation you would probably guess wrong and not help us a bit. And even if you are, educated explanations are a great supplement to but no substitute for facts. If we are going to fix the bug we still have to see it happen for ourselves first. Reporting the bare facts is relatively straightforward (you can probably copy and paste them from the screen) but all too often important details are left out because someone thought it does not matter or the report would be understood anyway.
The following items should be contained in every bug report:
The exact sequence of steps from program
start-up necessary to reproduce the problem. This
should be self-contained; it is not enough to send in a bare
SELECT
statement without the preceding
CREATE TABLE
and INSERT
statements, if the output should depend on the data in the
tables. We do not have the time to reverse-engineer your
database schema, and if we are supposed to make up our own data
we would probably miss the problem.
The best format for a test case for SQL-related problems is a
file that can be run through the psql
frontend that shows the problem. (Be sure to not have anything
in your ~/.psqlrc
start-up file.) An easy
way to create this file is to use pg_dump
to dump out the table declarations and data needed to set the
scene, then add the problem query. You are encouraged to
minimize the size of your example, but this is not absolutely
necessary. If the bug is reproducible, we will find it either
way.
If your application uses some other client interface, such as PHP, then please try to isolate the offending queries. We will probably not set up a web server to reproduce your problem. In any case remember to provide the exact input files; do not guess that the problem happens for “large files” or “midsize databases”, etc. since this information is too inexact to be of use.
The output you got. Please do not say that it “didn't work” or “crashed”. If there is an error message, show it, even if you do not understand it. If the program terminates with an operating system error, say which. If nothing at all happens, say so. Even if the result of your test case is a program crash or otherwise obvious it might not happen on our platform. The easiest thing is to copy the output from the terminal, if possible.
If you are reporting an error message, please obtain the most verbose
form of the message. In psql, say \set
VERBOSITY verbose
beforehand. If you are extracting the message
from the server log, set the run-time parameter
log_error_verbosity to verbose
so that all
details are logged.
In case of fatal errors, the error message reported by the client might not contain all the information available. Please also look at the log output of the database server. If you do not keep your server's log output, this would be a good time to start doing so.
The output you expected is very important to state. If you just write “This command gives me that output.” or “This is not what I expected.”, we might run it ourselves, scan the output, and think it looks OK and is exactly what we expected. We should not have to spend the time to decode the exact semantics behind your commands. Especially refrain from merely saying that “This is not what SQL says/Oracle does.” Digging out the correct behavior from SQL is not a fun undertaking, nor do we all know how all the other relational databases out there behave. (If your problem is a program crash, you can obviously omit this item.)
Any command line options and other start-up options, including any relevant environment variables or configuration files that you changed from the default. Again, please provide exact information. If you are using a prepackaged distribution that starts the database server at boot time, you should try to find out how that is done.
Anything you did at all differently from the installation instructions.
The PostgreSQL version. You can run the command
SELECT version();
to
find out the version of the server you are connected to. Most executable
programs also support a --version
option; at least
postgres --version
and psql --version
should work.
If the function or the options do not exist then your version is
more than old enough to warrant an upgrade.
If you run a prepackaged version, such as RPMs, say so, including any
subversion the package might have. If you are talking about a Git
snapshot, mention that, including the commit hash.
If your version is older than 14.13 we will almost certainly tell you to upgrade. There are many bug fixes and improvements in each new release, so it is quite possible that a bug you have encountered in an older release of PostgreSQL has already been fixed. We can only provide limited support for sites using older releases of PostgreSQL; if you require more than we can provide, consider acquiring a commercial support contract.
Platform information. This includes the kernel name and version, C library, processor, memory information, and so on. In most cases it is sufficient to report the vendor and version, but do not assume everyone knows what exactly “Debian” contains or that everyone runs on x86_64. If you have installation problems then information about the toolchain on your machine (compiler, make, and so on) is also necessary.
Do not be afraid if your bug report becomes rather lengthy. That is a fact of life. It is better to report everything the first time than us having to squeeze the facts out of you. On the other hand, if your input files are huge, it is fair to ask first whether somebody is interested in looking into it. Here is an article that outlines some more tips on reporting bugs.
Do not spend all your time to figure out which changes in the input make the problem go away. This will probably not help solving it. If it turns out that the bug cannot be fixed right away, you will still have time to find and share your work-around. Also, once again, do not waste your time guessing why the bug exists. We will find that out soon enough.
When writing a bug report, please avoid confusing terminology. The software package in total is called “PostgreSQL”, sometimes “Postgres” for short. If you are specifically talking about the backend process, mention that, do not just say “PostgreSQL crashes”. A crash of a single backend process is quite different from crash of the parent “postgres” process; please don't say “the server crashed” when you mean a single backend process went down, nor vice versa. Also, client programs such as the interactive frontend “psql” are completely separate from the backend. Please try to be specific about whether the problem is on the client or server side.
In general, send bug reports to the bug report mailing list at
<pgsql-bugs@lists.postgresql.org>
.
You are requested to use a descriptive subject for your email
message, perhaps parts of the error message.
Another method is to fill in the bug report web-form available
at the project's
web site.
Entering a bug report this way causes it to be mailed to the
<pgsql-bugs@lists.postgresql.org>
mailing list.
If your bug report has security implications and you'd prefer that it
not become immediately visible in public archives, don't send it to
pgsql-bugs
. Security issues can be
reported privately to <security@postgresql.org>
.
Do not send bug reports to any of the user mailing lists, such as
<pgsql-sql@lists.postgresql.org>
or
<pgsql-general@lists.postgresql.org>
.
These mailing lists are for answering
user questions, and their subscribers normally do not wish to receive
bug reports. More importantly, they are unlikely to fix them.
Also, please do not send reports to
the developers' mailing list <pgsql-hackers@lists.postgresql.org>
.
This list is for discussing the
development of PostgreSQL, and it would be nice
if we could keep the bug reports separate. We might choose to take up a
discussion about your bug report on pgsql-hackers
,
if the problem needs more review.
If you have a problem with the documentation, the best place to report it
is the documentation mailing list <pgsql-docs@lists.postgresql.org>
.
Please be specific about what part of the documentation you are unhappy
with.
If your bug is a portability problem on a non-supported platform,
send mail to <pgsql-hackers@lists.postgresql.org>
,
so we (and you) can work on
porting PostgreSQL to your platform.
Due to the unfortunate amount of spam going around, all of the above lists will be moderated unless you are subscribed. That means there will be some delay before the email is delivered. If you wish to subscribe to the lists, please visit https://lists.postgresql.org/ for instructions.
Welcome to the PostgreSQL Tutorial. The following few chapters are intended to give a simple introduction to PostgreSQL, relational database concepts, and the SQL language to those who are new to any one of these aspects. We only assume some general knowledge about how to use computers. No particular Unix or programming experience is required. This part is mainly intended to give you some hands-on experience with important aspects of the PostgreSQL system. It makes no attempt to be a complete or thorough treatment of the topics it covers.
After you have worked through this tutorial you might want to move on to reading Part II to gain a more formal knowledge of the SQL language, or Part IV for information about developing applications for PostgreSQL. Those who set up and manage their own server should also read Part III.
Table of Contents
Table of Contents
Before you can use PostgreSQL you need to install it, of course. It is possible that PostgreSQL is already installed at your site, either because it was included in your operating system distribution or because the system administrator already installed it. If that is the case, you should obtain information from the operating system documentation or your system administrator about how to access PostgreSQL.
If you are not sure whether PostgreSQL is already available or whether you can use it for your experimentation then you can install it yourself. Doing so is not hard and it can be a good exercise. PostgreSQL can be installed by any unprivileged user; no superuser (root) access is required.
If you are installing PostgreSQL yourself, then refer to Chapter 17 for instructions on installation, and return to this guide when the installation is complete. Be sure to follow closely the section about setting up the appropriate environment variables.
If your site administrator has not set things up in the default
way, you might have some more work to do. For example, if the
database server machine is a remote machine, you will need to set
the PGHOST
environment variable to the name of the
database server machine. The environment variable
PGPORT
might also have to be set. The bottom line is
this: if you try to start an application program and it complains
that it cannot connect to the database, you should consult your
site administrator or, if that is you, the documentation to make
sure that your environment is properly set up. If you did not
understand the preceding paragraph then read the next section.
Before we proceed, you should understand the basic PostgreSQL system architecture. Understanding how the parts of PostgreSQL interact will make this chapter somewhat clearer.
In database jargon, PostgreSQL uses a client/server model. A PostgreSQL session consists of the following cooperating processes (programs):
A server process, which manages the database files, accepts
connections to the database from client applications, and
performs database actions on behalf of the clients. The
database server program is called
postgres
.
The user's client (frontend) application that wants to perform database operations. Client applications can be very diverse in nature: a client could be a text-oriented tool, a graphical application, a web server that accesses the database to display web pages, or a specialized database maintenance tool. Some client applications are supplied with the PostgreSQL distribution; most are developed by users.
As is typical of client/server applications, the client and the server can be on different hosts. In that case they communicate over a TCP/IP network connection. You should keep this in mind, because the files that can be accessed on a client machine might not be accessible (or might only be accessible using a different file name) on the database server machine.
The PostgreSQL server can handle
multiple concurrent connections from clients. To achieve this it
starts (“forks”) a new process for each connection.
From that point on, the client and the new server process
communicate without intervention by the original
postgres
process. Thus, the
supervisor server process is always running, waiting for
client connections, whereas client and associated server processes
come and go. (All of this is of course invisible to the user. We
only mention it here for completeness.)
The first test to see whether you can access the database server is to try to create a database. A running PostgreSQL server can manage many databases. Typically, a separate database is used for each project or for each user.
Possibly, your site administrator has already created a database for your use. In that case you can omit this step and skip ahead to the next section.
To create a new database, in this example named
mydb
, you use the following command:
$
createdb mydb
If this produces no response then this step was successful and you can skip over the remainder of this section.
If you see a message similar to:
createdb: command not found
then PostgreSQL was not installed properly. Either it was not installed at all or your shell's search path was not set to include it. Try calling the command with an absolute path instead:
$
/usr/local/pgsql/bin/createdb mydb
The path at your site might be different. Contact your site administrator or check the installation instructions to correct the situation.
Another response could be this:
createdb: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or directory Is the server running locally and accepting connections on that socket?
This means that the server was not started, or it is not listening
where createdb
expects to contact it. Again, check the
installation instructions or consult the administrator.
Another response could be this:
createdb: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: role "joe" does not exist
where your own login name is mentioned. This will happen if the
administrator has not created a PostgreSQL user account
for you. (PostgreSQL user accounts are distinct from
operating system user accounts.) If you are the administrator, see
Chapter 22 for help creating accounts. You will need to
become the operating system user under which PostgreSQL
was installed (usually postgres
) to create the first user
account. It could also be that you were assigned a
PostgreSQL user name that is different from your
operating system user name; in that case you need to use the -U
switch or set the PGUSER
environment variable to specify your
PostgreSQL user name.
If you have a user account but it does not have the privileges required to create a database, you will see the following:
createdb: error: database creation failed: ERROR: permission denied to create database
Not every user has authorization to create new databases. If PostgreSQL refuses to create databases for you then the site administrator needs to grant you permission to create databases. Consult your site administrator if this occurs. If you installed PostgreSQL yourself then you should log in for the purposes of this tutorial under the user account that you started the server as. [1]
You can also create databases with other names. PostgreSQL allows you to create any number of databases at a given site. Database names must have an alphabetic first character and are limited to 63 bytes in length. A convenient choice is to create a database with the same name as your current user name. Many tools assume that database name as the default, so it can save you some typing. To create that database, simply type:
$
createdb
If you do not want to use your database anymore you can remove it.
For example, if you are the owner (creator) of the database
mydb
, you can destroy it using the following
command:
$
dropdb mydb
(For this command, the database name does not default to the user account name. You always need to specify it.) This action physically removes all files associated with the database and cannot be undone, so this should only be done with a great deal of forethought.
More about createdb
and dropdb
can
be found in createdb and dropdb
respectively.
Once you have created a database, you can access it by:
Running the PostgreSQL interactive terminal program, called psql, which allows you to interactively enter, edit, and execute SQL commands.
Using an existing graphical frontend tool like pgAdmin or an office suite with ODBC or JDBC support to create and manipulate a database. These possibilities are not covered in this tutorial.
Writing a custom application, using one of the several available language bindings. These possibilities are discussed further in Part IV.
You probably want to start up psql
to try
the examples in this tutorial. It can be activated for the
mydb
database by typing the command:
$
psql mydb
If you do not supply the database name then it will default to your
user account name. You already discovered this scheme in the
previous section using createdb
.
In psql
, you will be greeted with the following
message:
psql (14.13) Type "help" for help. mydb=>
mydb=#
That would mean you are a database superuser, which is most likely the case if you installed the PostgreSQL instance yourself. Being a superuser means that you are not subject to access controls. For the purposes of this tutorial that is not important.
If you encounter problems starting psql
then go back to the previous section. The diagnostics of
createdb
and psql
are
similar, and if the former worked the latter should work as well.
The last line printed out by psql
is the
prompt, and it indicates that psql
is listening
to you and that you can type SQL queries into a
work space maintained by psql
. Try out these
commands:
mydb=>
SELECT version();
version ------------------------------------------------------------------------------------------ PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit (1 row)mydb=>
SELECT current_date;
date ------------ 2016-01-07 (1 row)mydb=>
SELECT 2 + 2;
?column? ---------- 4 (1 row)
The psql
program has a number of internal
commands that are not SQL commands. They begin with the backslash
character, “\
”.
For example,
you can get help on the syntax of various
PostgreSQL SQL
commands by typing:
mydb=>
\h
To get out of psql
, type:
mydb=>
\q
and psql
will quit and return you to your
command shell. (For more internal commands, type
\?
at the psql
prompt.) The
full capabilities of psql
are documented in
psql. In this tutorial we will not use these
features explicitly, but you can use them yourself when it is helpful.
[1]
As an explanation for why this works:
PostgreSQL user names are separate
from operating system user accounts. When you connect to a
database, you can choose what
PostgreSQL user name to connect as;
if you don't, it will default to the same name as your current
operating system account. As it happens, there will always be a
PostgreSQL user account that has the
same name as the operating system user that started the server,
and it also happens that that user always has permission to
create databases. Instead of logging in as that user you can
also specify the -U
option everywhere to select
a PostgreSQL user name to connect as.
Table of Contents
This chapter provides an overview of how to use SQL to perform simple operations. This tutorial is only intended to give you an introduction and is in no way a complete tutorial on SQL. Numerous books have been written on SQL, including [melt93] and [date97]. You should be aware that some PostgreSQL language features are extensions to the standard.
In the examples that follow, we assume that you have created a
database named mydb
, as described in the previous
chapter, and have been able to start psql.
Examples in this manual can also be found in the
PostgreSQL source distribution
in the directory src/tutorial/
. (Binary
distributions of PostgreSQL might not
provide those files.) To use those
files, first change to that directory and run make:
$
cd
...
/src/tutorial$
make
This creates the scripts and compiles the C files containing user-defined functions and types. Then, to start the tutorial, do the following:
$
psql -s mydb
...
mydb=>
\i basics.sql
The \i
command reads in commands from the
specified file. psql
's -s
option puts you in
single step mode which pauses before sending each statement to the
server. The commands used in this section are in the file
basics.sql
.
PostgreSQL is a relational database management system (RDBMS). That means it is a system for managing data stored in relations. Relation is essentially a mathematical term for table. The notion of storing data in tables is so commonplace today that it might seem inherently obvious, but there are a number of other ways of organizing databases. Files and directories on Unix-like operating systems form an example of a hierarchical database. A more modern development is the object-oriented database.
Each table is a named collection of rows. Each row of a given table has the same set of named columns, and each column is of a specific data type. Whereas columns have a fixed order in each row, it is important to remember that SQL does not guarantee the order of the rows within the table in any way (although they can be explicitly sorted for display).
Tables are grouped into databases, and a collection of databases managed by a single PostgreSQL server instance constitutes a database cluster.
You can create a new table by specifying the table name, along with all column names and their types:
CREATE TABLE weather ( city varchar(80), temp_lo int, -- low temperature temp_hi int, -- high temperature prcp real, -- precipitation date date );
You can enter this into psql
with the line
breaks. psql
will recognize that the command
is not terminated until the semicolon.
White space (i.e., spaces, tabs, and newlines) can be used freely
in SQL commands. That means you can type the command aligned
differently than above, or even all on one line. Two dashes
(“--
”) introduce comments.
Whatever follows them is ignored up to the end of the line. SQL
is case insensitive about key words and identifiers, except
when identifiers are double-quoted to preserve the case (not done
above).
varchar(80)
specifies a data type that can store
arbitrary character strings up to 80 characters in length.
int
is the normal integer type. real
is
a type for storing single precision floating-point numbers.
date
should be self-explanatory. (Yes, the column of
type date
is also named date
.
This might be convenient or confusing — you choose.)
PostgreSQL supports the standard
SQL types int
,
smallint
, real
, double
precision
, char(
,
N
)varchar(
, N
)date
,
time
, timestamp
, and
interval
, as well as other types of general utility
and a rich set of geometric types.
PostgreSQL can be customized with an
arbitrary number of user-defined data types. Consequently, type
names are not key words in the syntax, except where required to
support special cases in the SQL standard.
The second example will store cities and their associated geographical location:
CREATE TABLE cities ( name varchar(80), location point );
The point
type is an example of a
PostgreSQL-specific data type.
Finally, it should be mentioned that if you don't need a table any longer or want to recreate it differently you can remove it using the following command:
DROP TABLE tablename
;
The INSERT
statement is used to populate a table with
rows:
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
Note that all data types use rather obvious input formats.
Constants that are not simple numeric values usually must be
surrounded by single quotes ('
), as in the example.
The
date
type is actually quite flexible in what it
accepts, but for this tutorial we will stick to the unambiguous
format shown here.
The point
type requires a coordinate pair as input,
as shown here:
INSERT INTO cities VALUES ('San Francisco', '(-194.0, 53.0)');
The syntax used so far requires you to remember the order of the columns. An alternative syntax allows you to list the columns explicitly:
INSERT INTO weather (city, temp_lo, temp_hi, prcp, date) VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');
You can list the columns in a different order if you wish or even omit some columns, e.g., if the precipitation is unknown:
INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward', 54, 37);
Many developers consider explicitly listing the columns better style than relying on the order implicitly.
Please enter all the commands shown above so you have some data to work with in the following sections.
You could also have used COPY
to load large
amounts of data from flat-text files. This is usually faster
because the COPY
command is optimized for this
application while allowing less flexibility than
INSERT
. An example would be:
COPY weather FROM '/home/user/weather.txt';
where the file name for the source file must be available on the
machine running the backend process, not the client, since the backend process
reads the file directly. You can read more about the
COPY
command in COPY.
To retrieve data from a table, the table is
queried. An SQL
SELECT
statement is used to do this. The
statement is divided into a select list (the part that lists the
columns to be returned), a table list (the part that lists the
tables from which to retrieve the data), and an optional
qualification (the part that specifies any restrictions). For
example, to retrieve all the rows of table
weather
, type:
SELECT * FROM weather;
Here *
is a shorthand for “all columns”.
[2]
So the same result would be had with:
SELECT city, temp_lo, temp_hi, prcp, date FROM weather;
The output should be:
city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ San Francisco | 46 | 50 | 0.25 | 1994-11-27 San Francisco | 43 | 57 | 0 | 1994-11-29 Hayward | 37 | 54 | | 1994-11-29 (3 rows)
You can write expressions, not just simple column references, in the select list. For example, you can do:
SELECT city, (temp_hi+temp_lo)/2 AS temp_avg, date FROM weather;
This should give:
city | temp_avg | date ---------------+----------+------------ San Francisco | 48 | 1994-11-27 San Francisco | 50 | 1994-11-29 Hayward | 45 | 1994-11-29 (3 rows)
Notice how the AS
clause is used to relabel the
output column. (The AS
clause is optional.)
A query can be “qualified” by adding a WHERE
clause that specifies which rows are wanted. The WHERE
clause contains a Boolean (truth value) expression, and only rows for
which the Boolean expression is true are returned. The usual
Boolean operators (AND
,
OR
, and NOT
) are allowed in
the qualification. For example, the following
retrieves the weather of San Francisco on rainy days:
SELECT * FROM weather WHERE city = 'San Francisco' AND prcp > 0.0;
Result:
city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ San Francisco | 46 | 50 | 0.25 | 1994-11-27 (1 row)
You can request that the results of a query be returned in sorted order:
SELECT * FROM weather ORDER BY city;
city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ Hayward | 37 | 54 | | 1994-11-29 San Francisco | 43 | 57 | 0 | 1994-11-29 San Francisco | 46 | 50 | 0.25 | 1994-11-27
In this example, the sort order isn't fully specified, and so you might get the San Francisco rows in either order. But you'd always get the results shown above if you do:
SELECT * FROM weather ORDER BY city, temp_lo;
You can request that duplicate rows be removed from the result of a query:
SELECT DISTINCT city FROM weather;
city --------------- Hayward San Francisco (2 rows)
Here again, the result row ordering might vary.
You can ensure consistent results by using DISTINCT
and
ORDER BY
together:
[3]
SELECT DISTINCT city FROM weather ORDER BY city;
Thus far, our queries have only accessed one table at a time.
Queries can access multiple tables at once, or access the same
table in such a way that multiple rows of the table are being
processed at the same time. Queries that access multiple tables
(or multiple instances of the same table) at one time are called
join queries. They combine rows from one table
with rows from a second table, with an expression specifying which rows
are to be paired. For example, to return all the weather records together
with the location of the associated city, the database needs to compare
the city
column of each row of the weather
table with the
name
column of all rows in the cities
table, and select the pairs of rows where these values match.[4]
This would be accomplished by the following query:
SELECT * FROM weather JOIN cities ON city = name;
city | temp_lo | temp_hi | prcp | date | name | location ---------------+---------+---------+------+------------+---------------+----------- San Francisco | 46 | 50 | 0.25 | 1994-11-27 | San Francisco | (-194,53) San Francisco | 43 | 57 | 0 | 1994-11-29 | San Francisco | (-194,53) (2 rows)
Observe two things about the result set:
There is no result row for the city of Hayward. This is
because there is no matching entry in the
cities
table for Hayward, so the join
ignores the unmatched rows in the weather
table. We will see
shortly how this can be fixed.
There are two columns containing the city name. This is
correct because the lists of columns from the
weather
and
cities
tables are concatenated. In
practice this is undesirable, though, so you will probably want
to list the output columns explicitly rather than using
*
:
SELECT city, temp_lo, temp_hi, prcp, date, location FROM weather JOIN cities ON city = name;
Since the columns all had different names, the parser automatically found which table they belong to. If there were duplicate column names in the two tables you'd need to qualify the column names to show which one you meant, as in:
SELECT weather.city, weather.temp_lo, weather.temp_hi, weather.prcp, weather.date, cities.location FROM weather JOIN cities ON weather.city = cities.name;
It is widely considered good style to qualify all column names in a join query, so that the query won't fail if a duplicate column name is later added to one of the tables.
Join queries of the kind seen thus far can also be written in this form:
SELECT * FROM weather, cities WHERE city = name;
This syntax pre-dates the JOIN
/ON
syntax, which was introduced in SQL-92. The tables are simply listed in
the FROM
clause, and the comparison expression is added
to the WHERE
clause. The results from this older
implicit syntax and the newer explicit
JOIN
/ON
syntax are identical. But
for a reader of the query, the explicit syntax makes its meaning easier to
understand: The join condition is introduced by its own key word whereas
previously the condition was mixed into the WHERE
clause together with other conditions.
Now we will figure out how we can get the Hayward records back in.
What we want the query to do is to scan the
weather
table and for each row to find the
matching cities
row(s). If no matching row is
found we want some “empty values” to be substituted
for the cities
table's columns. This kind
of query is called an outer join. (The
joins we have seen so far are inner joins.)
The command looks like this:
SELECT * FROM weather LEFT OUTER JOIN cities ON weather.city = cities.name;
city | temp_lo | temp_hi | prcp | date | name | location ---------------+---------+---------+------+------------+---------------+----------- Hayward | 37 | 54 | | 1994-11-29 | | San Francisco | 46 | 50 | 0.25 | 1994-11-27 | San Francisco | (-194,53) San Francisco | 43 | 57 | 0 | 1994-11-29 | San Francisco | (-194,53) (3 rows)
This query is called a left outer join because the table mentioned on the left of the join operator will have each of its rows in the output at least once, whereas the table on the right will only have those rows output that match some row of the left table. When outputting a left-table row for which there is no right-table match, empty (null) values are substituted for the right-table columns.
Exercise: There are also right outer joins and full outer joins. Try to find out what those do.
We can also join a table against itself. This is called a
self join. As an example, suppose we wish
to find all the weather records that are in the temperature range
of other weather records. So we need to compare the
temp_lo
and temp_hi
columns of
each weather
row to the
temp_lo
and
temp_hi
columns of all other
weather
rows. We can do this with the
following query:
SELECT w1.city, w1.temp_lo AS low, w1.temp_hi AS high, w2.city, w2.temp_lo AS low, w2.temp_hi AS high FROM weather w1 JOIN weather w2 ON w1.temp_lo < w2.temp_lo AND w1.temp_hi > w2.temp_hi;
city | low | high | city | low | high ---------------+-----+------+---------------+-----+------ San Francisco | 43 | 57 | San Francisco | 46 | 50 Hayward | 37 | 54 | San Francisco | 46 | 50 (2 rows)
Here we have relabeled the weather table as w1
and
w2
to be able to distinguish the left and right side
of the join. You can also use these kinds of aliases in other
queries to save some typing, e.g.:
SELECT * FROM weather w JOIN cities c ON w.city = c.name;
You will encounter this style of abbreviating quite frequently.
Like most other relational database products,
PostgreSQL supports
aggregate functions.
An aggregate function computes a single result from multiple input rows.
For example, there are aggregates to compute the
count
, sum
,
avg
(average), max
(maximum) and
min
(minimum) over a set of rows.
As an example, we can find the highest low-temperature reading anywhere with:
SELECT max(temp_lo) FROM weather;
max ----- 46 (1 row)
If we wanted to know what city (or cities) that reading occurred in, we might try:
SELECT city FROM weather WHERE temp_lo = max(temp_lo); WRONG
but this will not work since the aggregate
max
cannot be used in the
WHERE
clause. (This restriction exists because
the WHERE
clause determines which rows will be
included in the aggregate calculation; so obviously it has to be evaluated
before aggregate functions are computed.)
However, as is often the case
the query can be restated to accomplish the desired result, here
by using a subquery:
SELECT city FROM weather WHERE temp_lo = (SELECT max(temp_lo) FROM weather);
city --------------- San Francisco (1 row)
This is OK because the subquery is an independent computation that computes its own aggregate separately from what is happening in the outer query.
Aggregates are also very useful in combination with GROUP
BY
clauses. For example, we can get the number of readings
and the maximum low temperature observed in each city with:
SELECT city, count(*), max(temp_lo) FROM weather GROUP BY city;
city | count | max ---------------+-------+----- Hayward | 1 | 37 San Francisco | 2 | 46 (2 rows)
which gives us one output row per city. Each aggregate result is
computed over the table rows matching that city.
We can filter these grouped
rows using HAVING
:
SELECT city, count(*), max(temp_lo) FROM weather GROUP BY city HAVING max(temp_lo) < 40;
city | count | max ---------+-------+----- Hayward | 1 | 37 (1 row)
which gives us the same results for only the cities that have all
temp_lo
values below 40. Finally, if we only care about
cities whose
names begin with “S
”, we might do:
SELECT city, count(*), max(temp_lo) FROM weather WHERE city LIKE 'S%' -- (1) GROUP BY city;
city | count | max ---------------+-------+----- San Francisco | 2 | 46 (1 row)
The |
It is important to understand the interaction between aggregates and
SQL's WHERE
and HAVING
clauses.
The fundamental difference between WHERE
and
HAVING
is this: WHERE
selects
input rows before groups and aggregates are computed (thus, it controls
which rows go into the aggregate computation), whereas
HAVING
selects group rows after groups and
aggregates are computed. Thus, the
WHERE
clause must not contain aggregate functions;
it makes no sense to try to use an aggregate to determine which rows
will be inputs to the aggregates. On the other hand, the
HAVING
clause always contains aggregate functions.
(Strictly speaking, you are allowed to write a HAVING
clause that doesn't use aggregates, but it's seldom useful. The same
condition could be used more efficiently at the WHERE
stage.)
In the previous example, we can apply the city name restriction in
WHERE
, since it needs no aggregate. This is
more efficient than adding the restriction to HAVING
,
because we avoid doing the grouping and aggregate calculations
for all rows that fail the WHERE
check.
Another way to select the rows that go into an aggregate
computation is to use FILTER
, which is a
per-aggregate option:
SELECT city, count(*) FILTER (WHERE temp_lo < 45), max(temp_lo) FROM weather GROUP BY city;
city | count | max ---------------+-------+----- Hayward | 1 | 37 San Francisco | 1 | 46 (2 rows)
FILTER
is much like WHERE
,
except that it removes rows only from the input of the particular
aggregate function that it is attached to.
Here, the count
aggregate counts only
rows with temp_lo
below 45; but the
max
aggregate is still applied to all rows,
so it still finds the reading of 46.
You can update existing rows using the
UPDATE
command.
Suppose you discover the temperature readings are
all off by 2 degrees after November 28. You can correct the
data as follows:
UPDATE weather SET temp_hi = temp_hi - 2, temp_lo = temp_lo - 2 WHERE date > '1994-11-28';
Look at the new state of the data:
SELECT * FROM weather; city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ San Francisco | 46 | 50 | 0.25 | 1994-11-27 San Francisco | 41 | 55 | 0 | 1994-11-29 Hayward | 35 | 52 | | 1994-11-29 (3 rows)
Rows can be removed from a table using the DELETE
command.
Suppose you are no longer interested in the weather of Hayward.
Then you can do the following to delete those rows from the table:
DELETE FROM weather WHERE city = 'Hayward';
All weather records belonging to Hayward are removed.
SELECT * FROM weather;
city | temp_lo | temp_hi | prcp | date ---------------+---------+---------+------+------------ San Francisco | 46 | 50 | 0.25 | 1994-11-27 San Francisco | 41 | 55 | 0 | 1994-11-29 (2 rows)
One should be wary of statements of the form
DELETE FROM tablename
;
Without a qualification, DELETE
will
remove all rows from the given table, leaving it
empty. The system will not request confirmation before
doing this!
[2]
While SELECT *
is useful for off-the-cuff
queries, it is widely considered bad style in production code,
since adding a column to the table would change the results.
[3]
In some database systems, including older versions of
PostgreSQL, the implementation of
DISTINCT
automatically orders the rows and
so ORDER BY
is unnecessary. But this is not
required by the SQL standard, and current
PostgreSQL does not guarantee that
DISTINCT
causes the rows to be ordered.
[4] This is only a conceptual model. The join is usually performed in a more efficient manner than actually comparing each possible pair of rows, but this is invisible to the user.
Table of Contents
In the previous chapter we have covered the basics of using SQL to store and access your data in PostgreSQL. We will now discuss some more advanced features of SQL that simplify management and prevent loss or corruption of your data. Finally, we will look at some PostgreSQL extensions.
This chapter will on occasion refer to examples found in Chapter 2 to change or improve them, so it will be
useful to have read that chapter. Some examples from
this chapter can also be found in
advanced.sql
in the tutorial directory. This
file also contains some sample data to load, which is not
repeated here. (Refer to Section 2.1 for
how to use the file.)
Refer back to the queries in Section 2.6. Suppose the combined listing of weather records and city location is of particular interest to your application, but you do not want to type the query each time you need it. You can create a view over the query, which gives a name to the query that you can refer to like an ordinary table:
CREATE VIEW myview AS SELECT name, temp_lo, temp_hi, prcp, date, location FROM weather, cities WHERE city = name; SELECT * FROM myview;
Making liberal use of views is a key aspect of good SQL database design. Views allow you to encapsulate the details of the structure of your tables, which might change as your application evolves, behind consistent interfaces.
Views can be used in almost any place a real table can be used. Building views upon other views is not uncommon.
Recall the weather
and
cities
tables from Chapter 2. Consider the following problem: You
want to make sure that no one can insert rows in the
weather
table that do not have a matching
entry in the cities
table. This is called
maintaining the referential integrity of
your data. In simplistic database systems this would be
implemented (if at all) by first looking at the
cities
table to check if a matching record
exists, and then inserting or rejecting the new
weather
records. This approach has a
number of problems and is very inconvenient, so
PostgreSQL can do this for you.
The new declaration of the tables would look like this:
CREATE TABLE cities ( name varchar(80) primary key, location point ); CREATE TABLE weather ( city varchar(80) references cities(name), temp_lo int, temp_hi int, prcp real, date date );
Now try inserting an invalid record:
INSERT INTO weather VALUES ('Berkeley', 45, 53, 0.0, '1994-11-28');
ERROR: insert or update on table "weather" violates foreign key constraint "weather_city_fkey" DETAIL: Key (city)=(Berkeley) is not present in table "cities".
The behavior of foreign keys can be finely tuned to your application. We will not go beyond this simple example in this tutorial, but just refer you to Chapter 5 for more information. Making correct use of foreign keys will definitely improve the quality of your database applications, so you are strongly encouraged to learn about them.
Transactions are a fundamental concept of all database systems. The essential point of a transaction is that it bundles multiple steps into a single, all-or-nothing operation. The intermediate states between the steps are not visible to other concurrent transactions, and if some failure occurs that prevents the transaction from completing, then none of the steps affect the database at all.
For example, consider a bank database that contains balances for various customer accounts, as well as total deposit balances for branches. Suppose that we want to record a payment of $100.00 from Alice's account to Bob's account. Simplifying outrageously, the SQL commands for this might look like:
UPDATE accounts SET balance = balance - 100.00 WHERE name = 'Alice'; UPDATE branches SET balance = balance - 100.00 WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Alice'); UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob'; UPDATE branches SET balance = balance + 100.00 WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Bob');
The details of these commands are not important here; the important point is that there are several separate updates involved to accomplish this rather simple operation. Our bank's officers will want to be assured that either all these updates happen, or none of them happen. It would certainly not do for a system failure to result in Bob receiving $100.00 that was not debited from Alice. Nor would Alice long remain a happy customer if she was debited without Bob being credited. We need a guarantee that if something goes wrong partway through the operation, none of the steps executed so far will take effect. Grouping the updates into a transaction gives us this guarantee. A transaction is said to be atomic: from the point of view of other transactions, it either happens completely or not at all.
We also want a guarantee that once a transaction is completed and acknowledged by the database system, it has indeed been permanently recorded and won't be lost even if a crash ensues shortly thereafter. For example, if we are recording a cash withdrawal by Bob, we do not want any chance that the debit to his account will disappear in a crash just after he walks out the bank door. A transactional database guarantees that all the updates made by a transaction are logged in permanent storage (i.e., on disk) before the transaction is reported complete.
Another important property of transactional databases is closely related to the notion of atomic updates: when multiple transactions are running concurrently, each one should not be able to see the incomplete changes made by others. For example, if one transaction is busy totalling all the branch balances, it would not do for it to include the debit from Alice's branch but not the credit to Bob's branch, nor vice versa. So transactions must be all-or-nothing not only in terms of their permanent effect on the database, but also in terms of their visibility as they happen. The updates made so far by an open transaction are invisible to other transactions until the transaction completes, whereupon all the updates become visible simultaneously.
In PostgreSQL, a transaction is set up by surrounding
the SQL commands of the transaction with
BEGIN
and COMMIT
commands. So our banking
transaction would actually look like:
BEGIN; UPDATE accounts SET balance = balance - 100.00 WHERE name = 'Alice'; -- etc etc COMMIT;
If, partway through the transaction, we decide we do not want to
commit (perhaps we just noticed that Alice's balance went negative),
we can issue the command ROLLBACK
instead of
COMMIT
, and all our updates so far will be canceled.
PostgreSQL actually treats every SQL statement as being
executed within a transaction. If you do not issue a BEGIN
command,
then each individual statement has an implicit BEGIN
and
(if successful) COMMIT
wrapped around it. A group of
statements surrounded by BEGIN
and COMMIT
is sometimes called a transaction block.
Some client libraries issue BEGIN
and COMMIT
commands automatically, so that you might get the effect of transaction
blocks without asking. Check the documentation for the interface
you are using.
It's possible to control the statements in a transaction in a more
granular fashion through the use of savepoints. Savepoints
allow you to selectively discard parts of the transaction, while
committing the rest. After defining a savepoint with
SAVEPOINT
, you can if needed roll back to the savepoint
with ROLLBACK TO
. All the transaction's database changes
between defining the savepoint and rolling back to it are discarded, but
changes earlier than the savepoint are kept.
After rolling back to a savepoint, it continues to be defined, so you can roll back to it several times. Conversely, if you are sure you won't need to roll back to a particular savepoint again, it can be released, so the system can free some resources. Keep in mind that either releasing or rolling back to a savepoint will automatically release all savepoints that were defined after it.
All this is happening within the transaction block, so none of it is visible to other database sessions. When and if you commit the transaction block, the committed actions become visible as a unit to other sessions, while the rolled-back actions never become visible at all.
Remembering the bank database, suppose we debit $100.00 from Alice's account, and credit Bob's account, only to find later that we should have credited Wally's account. We could do it using savepoints like this:
BEGIN; UPDATE accounts SET balance = balance - 100.00 WHERE name = 'Alice'; SAVEPOINT my_savepoint; UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Bob'; -- oops ... forget that and use Wally's account ROLLBACK TO my_savepoint; UPDATE accounts SET balance = balance + 100.00 WHERE name = 'Wally'; COMMIT;
This example is, of course, oversimplified, but there's a lot of control
possible in a transaction block through the use of savepoints.
Moreover, ROLLBACK TO
is the only way to regain control of a
transaction block that was put in aborted state by the
system due to an error, short of rolling it back completely and starting
again.
A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
Here is an example that shows how to compare each employee's salary with the average salary in his or her department:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
depname | empno | salary | avg -----------+-------+--------+----------------------- develop | 11 | 5200 | 5020.0000000000000000 develop | 7 | 4200 | 5020.0000000000000000 develop | 9 | 4500 | 5020.0000000000000000 develop | 8 | 6000 | 5020.0000000000000000 develop | 10 | 5200 | 5020.0000000000000000 personnel | 5 | 3500 | 3700.0000000000000000 personnel | 2 | 3900 | 3700.0000000000000000 sales | 3 | 4800 | 4866.6666666666666667 sales | 1 | 5000 | 4866.6666666666666667 sales | 4 | 4800 | 4866.6666666666666667 (10 rows)
The first three output columns come directly from the table
empsalary
, and there is one output row for each row in the
table. The fourth column represents an average taken across all the table
rows that have the same depname
value as the current row.
(This actually is the same function as the non-window avg
aggregate, but the OVER
clause causes it to be
treated as a window function and computed across the window frame.)
A window function call always contains an OVER
clause
directly following the window function's name and argument(s). This is what
syntactically distinguishes it from a normal function or non-window
aggregate. The OVER
clause determines exactly how the
rows of the query are split up for processing by the window function.
The PARTITION BY
clause within OVER
divides the rows into groups, or partitions, that share the same
values of the PARTITION BY
expression(s). For each row,
the window function is computed across the rows that fall into the
same partition as the current row.
You can also control the order in which rows are processed by
window functions using ORDER BY
within OVER
.
(The window ORDER BY
does not even have to match the
order in which the rows are output.) Here is an example:
SELECT depname, empno, salary, rank() OVER (PARTITION BY depname ORDER BY salary DESC) FROM empsalary;
depname | empno | salary | rank -----------+-------+--------+------ develop | 8 | 6000 | 1 develop | 10 | 5200 | 2 develop | 11 | 5200 | 2 develop | 9 | 4500 | 4 develop | 7 | 4200 | 5 personnel | 2 | 3900 | 1 personnel | 5 | 3500 | 2 sales | 1 | 5000 | 1 sales | 4 | 4800 | 2 sales | 3 | 4800 | 2 (10 rows)
As shown here, the rank
function produces a numerical rank
for each distinct ORDER BY
value in the current row's
partition, using the order defined by the ORDER BY
clause.
rank
needs no explicit parameter, because its behavior
is entirely determined by the OVER
clause.
The rows considered by a window function are those of the “virtual
table” produced by the query's FROM
clause as filtered by its
WHERE
, GROUP BY
, and HAVING
clauses
if any. For example, a row removed because it does not meet the
WHERE
condition is not seen by any window function.
A query can contain multiple window functions that slice up the data
in different ways using different OVER
clauses, but
they all act on the same collection of rows defined by this virtual table.
We already saw that ORDER BY
can be omitted if the ordering
of rows is not important. It is also possible to omit PARTITION
BY
, in which case there is a single partition containing all rows.
There is another important concept associated with window functions:
for each row, there is a set of rows within its partition called its
window frame. Some window functions act only
on the rows of the window frame, rather than of the whole partition.
By default, if ORDER BY
is supplied then the frame consists of
all rows from the start of the partition up through the current row, plus
any following rows that are equal to the current row according to the
ORDER BY
clause. When ORDER BY
is omitted the
default frame consists of all rows in the partition.
[5]
Here is an example using sum
:
SELECT salary, sum(salary) OVER () FROM empsalary;
salary | sum --------+------- 5200 | 47100 5000 | 47100 3500 | 47100 4800 | 47100 3900 | 47100 4200 | 47100 4500 | 47100 4800 | 47100 6000 | 47100 5200 | 47100 (10 rows)
Above, since there is no ORDER BY
in the OVER
clause, the window frame is the same as the partition, which for lack of
PARTITION BY
is the whole table; in other words each sum is
taken over the whole table and so we get the same result for each output
row. But if we add an ORDER BY
clause, we get very different
results:
SELECT salary, sum(salary) OVER (ORDER BY salary) FROM empsalary;
salary | sum --------+------- 3500 | 3500 3900 | 7400 4200 | 11600 4500 | 16100 4800 | 25700 4800 | 25700 5000 | 30700 5200 | 41100 5200 | 41100 6000 | 47100 (10 rows)
Here the sum is taken from the first (lowest) salary up through the current one, including any duplicates of the current one (notice the results for the duplicated salaries).
Window functions are permitted only in the SELECT
list
and the ORDER BY
clause of the query. They are forbidden
elsewhere, such as in GROUP BY
, HAVING
and WHERE
clauses. This is because they logically
execute after the processing of those clauses. Also, window functions
execute after non-window aggregate functions. This means it is valid to
include an aggregate function call in the arguments of a window function,
but not vice versa.
If there is a need to filter or group rows after the window calculations are performed, you can use a sub-select. For example:
SELECT depname, empno, salary, enroll_date FROM (SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;
The above query only shows the rows from the inner query having
rank
less than 3.
When a query involves multiple window functions, it is possible to write
out each one with a separate OVER
clause, but this is
duplicative and error-prone if the same windowing behavior is wanted
for several functions. Instead, each windowing behavior can be named
in a WINDOW
clause and then referenced in OVER
.
For example:
SELECT sum(salary) OVER w, avg(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);
More details about window functions can be found in Section 4.2.8, Section 9.22, Section 7.2.5, and the SELECT reference page.
Inheritance is a concept from object-oriented databases. It opens up interesting new possibilities of database design.
Let's create two tables: A table cities
and a table capitals
. Naturally, capitals
are also cities, so you want some way to show the capitals
implicitly when you list all cities. If you're really clever you
might invent some scheme like this:
CREATE TABLE capitals ( name text, population real, elevation int, -- (in ft) state char(2) ); CREATE TABLE non_capitals ( name text, population real, elevation int -- (in ft) ); CREATE VIEW cities AS SELECT name, population, elevation FROM capitals UNION SELECT name, population, elevation FROM non_capitals;
This works OK as far as querying goes, but it gets ugly when you need to update several rows, for one thing.
A better solution is this:
CREATE TABLE cities ( name text, population real, elevation int -- (in ft) ); CREATE TABLE capitals ( state char(2) UNIQUE NOT NULL ) INHERITS (cities);
In this case, a row of capitals
inherits all columns (name
,
population
, and elevation
) from its
parent, cities
. The
type of the column name
is
text
, a native PostgreSQL
type for variable length character strings. The
capitals
table has
an additional column, state
, which shows its
state abbreviation. In
PostgreSQL, a table can inherit from
zero or more other tables.
For example, the following query finds the names of all cities, including state capitals, that are located at an elevation over 500 feet:
SELECT name, elevation FROM cities WHERE elevation > 500;
which returns:
name | elevation -----------+----------- Las Vegas | 2174 Mariposa | 1953 Madison | 845 (3 rows)
On the other hand, the following query finds all the cities that are not state capitals and are situated at an elevation over 500 feet:
SELECT name, elevation FROM ONLY cities WHERE elevation > 500;
name | elevation -----------+----------- Las Vegas | 2174 Mariposa | 1953 (2 rows)
Here the ONLY
before cities
indicates that the query should be run over only the
cities
table, and not tables below
cities
in the inheritance hierarchy. Many
of the commands that we have already discussed —
SELECT
, UPDATE
, and
DELETE
— support this ONLY
notation.
Although inheritance is frequently useful, it has not been integrated with unique constraints or foreign keys, which limits its usefulness. See Section 5.10 for more detail.
PostgreSQL has many features not touched upon in this tutorial introduction, which has been oriented toward newer users of SQL. These features are discussed in more detail in the remainder of this book.
If you feel you need more introductory material, please visit the PostgreSQL web site for links to more resources.
[5] There are options to define the window frame in other ways, but this tutorial does not cover them. See Section 4.2.8 for details.
This part describes the use of the SQL language in PostgreSQL. We start with describing the general syntax of SQL, then explain how to create the structures to hold data, how to populate the database, and how to query it. The middle part lists the available data types and functions for use in SQL commands. The rest treats several aspects that are important for tuning a database for optimal performance.
The information in this part is arranged so that a novice user can follow it start to end to gain a full understanding of the topics without having to refer forward too many times. The chapters are intended to be self-contained, so that advanced users can read the chapters individually as they choose. The information in this part is presented in a narrative fashion in topical units. Readers looking for a complete description of a particular command should see Part VI.
Readers of this part should know how to connect to a PostgreSQL database and issue SQL commands. Readers that are unfamiliar with these issues are encouraged to read Part I first. SQL commands are typically entered using the PostgreSQL interactive terminal psql, but other programs that have similar functionality can be used as well.
Table of Contents
pg_lsn
TypeORDER BY
Table of Contents
This chapter describes the syntax of SQL. It forms the foundation for understanding the following chapters which will go into detail about how SQL commands are applied to define and modify data.
We also advise users who are already familiar with SQL to read this chapter carefully because it contains several rules and concepts that are implemented inconsistently among SQL databases or that are specific to PostgreSQL.
SQL input consists of a sequence of commands. A command is composed of a sequence of tokens, terminated by a semicolon (“;”). The end of the input stream also terminates a command. Which tokens are valid depends on the syntax of the particular command.
A token can be a key word, an identifier, a quoted identifier, a literal (or constant), or a special character symbol. Tokens are normally separated by whitespace (space, tab, newline), but need not be if there is no ambiguity (which is generally only the case if a special character is adjacent to some other token type).
For example, the following is (syntactically) valid SQL input:
SELECT * FROM MY_TABLE; UPDATE MY_TABLE SET A = 5; INSERT INTO MY_TABLE VALUES (3, 'hi there');
This is a sequence of three commands, one per line (although this is not required; more than one command can be on a line, and commands can usefully be split across lines).
Additionally, comments can occur in SQL input. They are not tokens, they are effectively equivalent to whitespace.
The SQL syntax is not very consistent regarding what tokens
identify commands and which are operands or parameters. The first
few tokens are generally the command name, so in the above example
we would usually speak of a “SELECT”, an
“UPDATE”, and an “INSERT” command. But
for instance the UPDATE
command always requires
a SET
token to appear in a certain position, and
this particular variation of INSERT
also
requires a VALUES
in order to be complete. The
precise syntax rules for each command are described in Part VI.
Tokens such as SELECT
, UPDATE
, or
VALUES
in the example above are examples of
key words, that is, words that have a fixed
meaning in the SQL language. The tokens MY_TABLE
and A
are examples of
identifiers. They identify names of
tables, columns, or other database objects, depending on the
command they are used in. Therefore they are sometimes simply
called “names”. Key words and identifiers have the
same lexical structure, meaning that one cannot know whether a
token is an identifier or a key word without knowing the language.
A complete list of key words can be found in Appendix C.
SQL identifiers and key words must begin with a letter
(a
-z
, but also letters with
diacritical marks and non-Latin letters) or an underscore
(_
). Subsequent characters in an identifier or
key word can be letters, underscores, digits
(0
-9
), or dollar signs
($
). Note that dollar signs are not allowed in identifiers
according to the letter of the SQL standard, so their use might render
applications less portable.
The SQL standard will not define a key word that contains
digits or starts or ends with an underscore, so identifiers of this
form are safe against possible conflict with future extensions of the
standard.
The system uses no more than NAMEDATALEN
-1
bytes of an identifier; longer names can be written in
commands, but they will be truncated. By default,
NAMEDATALEN
is 64 so the maximum identifier
length is 63 bytes. If this limit is problematic, it can be raised by
changing the NAMEDATALEN
constant in
src/include/pg_config_manual.h
.
Key words and unquoted identifiers are case insensitive. Therefore:
UPDATE MY_TABLE SET A = 5;
can equivalently be written as:
uPDaTE my_TabLE SeT a = 5;
A convention often used is to write key words in upper case and names in lower case, e.g.:
UPDATE my_table SET a = 5;
There is a second kind of identifier: the delimited
identifier or quoted
identifier. It is formed by enclosing an arbitrary
sequence of characters in double-quotes
("
). A delimited
identifier is always an identifier, never a key word. So
"select"
could be used to refer to a column or
table named “select”, whereas an unquoted
select
would be taken as a key word and
would therefore provoke a parse error when used where a table or
column name is expected. The example can be written with quoted
identifiers like this:
UPDATE "my_table" SET "a" = 5;
Quoted identifiers can contain any character, except the character with code zero. (To include a double quote, write two double quotes.) This allows constructing table or column names that would otherwise not be possible, such as ones containing spaces or ampersands. The length limitation still applies.
Quoting an identifier also makes it case-sensitive, whereas
unquoted names are always folded to lower case. For example, the
identifiers FOO
, foo
, and
"foo"
are considered the same by
PostgreSQL, but
"Foo"
and "FOO"
are
different from these three and each other. (The folding of
unquoted names to lower case in PostgreSQL is
incompatible with the SQL standard, which says that unquoted names
should be folded to upper case. Thus, foo
should be equivalent to "FOO"
not
"foo"
according to the standard. If you want
to write portable applications you are advised to always quote a
particular name or never quote it.)
A variant of quoted
identifiers allows including escaped Unicode characters identified
by their code points. This variant starts
with U&
(upper or lower case U followed by
ampersand) immediately before the opening double quote, without
any spaces in between, for example U&"foo"
.
(Note that this creates an ambiguity with the
operator &
. Use spaces around the operator to
avoid this problem.) Inside the quotes, Unicode characters can be
specified in escaped form by writing a backslash followed by the
four-digit hexadecimal code point number or alternatively a
backslash followed by a plus sign followed by a six-digit
hexadecimal code point number. For example, the
identifier "data"
could be written as
U&"d\0061t\+000061"
The following less trivial example writes the Russian word “slon” (elephant) in Cyrillic letters:
U&"\0441\043B\043E\043D"
If a different escape character than backslash is desired, it can
be specified using
the UESCAPE
clause after the string, for example:
U&"d!0061t!+000061" UESCAPE '!'
The escape character can be any single character other than a
hexadecimal digit, the plus sign, a single quote, a double quote,
or a whitespace character. Note that the escape character is
written in single quotes, not double quotes,
after UESCAPE
.
To include the escape character in the identifier literally, write it twice.
Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate pairs are not stored directly, but are combined into a single code point.)
If the server encoding is not UTF-8, the Unicode code point identified by one of these escape sequences is converted to the actual server encoding; an error is reported if that's not possible.
There are three kinds of implicitly-typed constants in PostgreSQL: strings, bit strings, and numbers. Constants can also be specified with explicit types, which can enable more accurate representation and more efficient handling by the system. These alternatives are discussed in the following subsections.
A string constant in SQL is an arbitrary sequence of characters
bounded by single quotes ('
), for example
'This is a string'
. To include
a single-quote character within a string constant,
write two adjacent single quotes, e.g.,
'Dianne''s horse'
.
Note that this is not the same as a double-quote
character ("
).
Two string constants that are only separated by whitespace with at least one newline are concatenated and effectively treated as if the string had been written as one constant. For example:
SELECT 'foo' 'bar';
is equivalent to:
SELECT 'foobar';
but:
SELECT 'foo' 'bar';
is not valid syntax. (This slightly bizarre behavior is specified by SQL; PostgreSQL is following the standard.)
PostgreSQL also accepts “escape”
string constants, which are an extension to the SQL standard.
An escape string constant is specified by writing the letter
E
(upper or lower case) just before the opening single
quote, e.g., E'foo'
. (When continuing an escape string
constant across lines, write E
only before the first opening
quote.)
Within an escape string, a backslash character (\
) begins a
C-like backslash escape sequence, in which the combination
of backslash and following character(s) represent a special byte
value, as shown in Table 4.1.
Table 4.1. Backslash Escape Sequences
Backslash Escape Sequence | Interpretation |
---|---|
\b | backspace |
\f | form feed |
\n | newline |
\r | carriage return |
\t | tab |
\ ,
\ ,
\
(o = 0–7)
| octal byte value |
\x ,
\x
(h = 0–9, A–F)
| hexadecimal byte value |
\u ,
\U
(x = 0–9, A–F)
| 16 or 32-bit hexadecimal Unicode character value |
Any other
character following a backslash is taken literally. Thus, to
include a backslash character, write two backslashes (\\
).
Also, a single quote can be included in an escape string by writing
\'
, in addition to the normal way of ''
.
It is your responsibility that the byte sequences you create, especially when using the octal or hexadecimal escapes, compose valid characters in the server character set encoding. A useful alternative is to use Unicode escapes or the alternative Unicode escape syntax, explained in Section 4.1.2.3; then the server will check that the character conversion is possible.
If the configuration parameter
standard_conforming_strings is off
,
then PostgreSQL recognizes backslash escapes
in both regular and escape string constants. However, as of
PostgreSQL 9.1, the default is on
, meaning
that backslash escapes are recognized only in escape string constants.
This behavior is more standards-compliant, but might break applications
which rely on the historical behavior, where backslash escapes
were always recognized. As a workaround, you can set this parameter
to off
, but it is better to migrate away from using backslash
escapes. If you need to use a backslash escape to represent a special
character, write the string constant with an E
.
In addition to standard_conforming_strings
, the configuration
parameters escape_string_warning and
backslash_quote govern treatment of backslashes
in string constants.
The character with the code zero cannot be in a string constant.
PostgreSQL also supports another type
of escape syntax for strings that allows specifying arbitrary
Unicode characters by code point. A Unicode escape string
constant starts with U&
(upper or lower case
letter U followed by ampersand) immediately before the opening
quote, without any spaces in between, for
example U&'foo'
. (Note that this creates an
ambiguity with the operator &
. Use spaces
around the operator to avoid this problem.) Inside the quotes,
Unicode characters can be specified in escaped form by writing a
backslash followed by the four-digit hexadecimal code point
number or alternatively a backslash followed by a plus sign
followed by a six-digit hexadecimal code point number. For
example, the string 'data'
could be written as
U&'d\0061t\+000061'
The following less trivial example writes the Russian word “slon” (elephant) in Cyrillic letters:
U&'\0441\043B\043E\043D'
If a different escape character than backslash is desired, it can
be specified using
the UESCAPE
clause after the string, for example:
U&'d!0061t!+000061' UESCAPE '!'
The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character.
To include the escape character in the string literally, write it twice.
Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate pairs are not stored directly, but are combined into a single code point.)
If the server encoding is not UTF-8, the Unicode code point identified by one of these escape sequences is converted to the actual server encoding; an error is reported if that's not possible.
Also, the Unicode escape syntax for string constants only works when the configuration parameter standard_conforming_strings is turned on. This is because otherwise this syntax could confuse clients that parse the SQL statements to the point that it could lead to SQL injections and similar security issues. If the parameter is set to off, this syntax will be rejected with an error message.
While the standard syntax for specifying string constants is usually
convenient, it can be difficult to understand when the desired string
contains many single quotes, since each of those must
be doubled. To allow more readable queries in such situations,
PostgreSQL provides another way, called
“dollar quoting”, to write string constants.
A dollar-quoted string constant
consists of a dollar sign ($
), an optional
“tag” of zero or more characters, another dollar
sign, an arbitrary sequence of characters that makes up the
string content, a dollar sign, the same tag that began this
dollar quote, and a dollar sign. For example, here are two
different ways to specify the string “Dianne's horse”
using dollar quoting:
$$Dianne's horse$$ $SomeTag$Dianne's horse$SomeTag$
Notice that inside the dollar-quoted string, single quotes can be used without needing to be escaped. Indeed, no characters inside a dollar-quoted string are ever escaped: the string content is always written literally. Backslashes are not special, and neither are dollar signs, unless they are part of a sequence matching the opening tag.
It is possible to nest dollar-quoted string constants by choosing different tags at each nesting level. This is most commonly used in writing function definitions. For example:
$function$ BEGIN RETURN ($1 ~ $q$[\t\r\n\v\\]$q$); END; $function$
Here, the sequence $q$[\t\r\n\v\\]$q$
represents a
dollar-quoted literal string [\t\r\n\v\\]
, which will
be recognized when the function body is executed by
PostgreSQL. But since the sequence does not match
the outer dollar quoting delimiter $function$
, it is
just some more characters within the constant so far as the outer
string is concerned.
The tag, if any, of a dollar-quoted string follows the same rules
as an unquoted identifier, except that it cannot contain a dollar sign.
Tags are case sensitive, so $tag$String content$tag$
is correct, but $TAG$String content$tag$
is not.
A dollar-quoted string that follows a keyword or identifier must be separated from it by whitespace; otherwise the dollar quoting delimiter would be taken as part of the preceding identifier.
Dollar quoting is not part of the SQL standard, but it is often a more convenient way to write complicated string literals than the standard-compliant single quote syntax. It is particularly useful when representing string constants inside other constants, as is often needed in procedural function definitions. With single-quote syntax, each backslash in the above example would have to be written as four backslashes, which would be reduced to two backslashes in parsing the original string constant, and then to one when the inner string constant is re-parsed during function execution.
Bit-string constants look like regular string constants with a
B
(upper or lower case) immediately before the
opening quote (no intervening whitespace), e.g.,
B'1001'
. The only characters allowed within
bit-string constants are 0
and
1
.
Alternatively, bit-string constants can be specified in hexadecimal
notation, using a leading X
(upper or lower case),
e.g., X'1FF'
. This notation is equivalent to
a bit-string constant with four binary digits for each hexadecimal digit.
Both forms of bit-string constant can be continued across lines in the same way as regular string constants. Dollar quoting cannot be used in a bit-string constant.
Numeric constants are accepted in these general forms:
digits
digits
.[digits
][e[+-]digits
] [digits
].digits
[e[+-]digits
]digits
e[+-]digits
where digits
is one or more decimal
digits (0 through 9). At least one digit must be before or after the
decimal point, if one is used. At least one digit must follow the
exponent marker (e
), if one is present.
There cannot be any spaces or other characters embedded in the
constant. Note that any leading plus or minus sign is not actually
considered part of the constant; it is an operator applied to the
constant.
These are some examples of valid numeric constants:
42
3.5
4.
.001
5e2
1.925e-3
A numeric constant that contains neither a decimal point nor an
exponent is initially presumed to be type integer
if its
value fits in type integer
(32 bits); otherwise it is
presumed to be type bigint
if its
value fits in type bigint
(64 bits); otherwise it is
taken to be type numeric
. Constants that contain decimal
points and/or exponents are always initially presumed to be type
numeric
.
The initially assigned data type of a numeric constant is just a
starting point for the type resolution algorithms. In most cases
the constant will be automatically coerced to the most
appropriate type depending on context. When necessary, you can
force a numeric value to be interpreted as a specific data type
by casting it.
For example, you can force a numeric value to be treated as type
real
(float4
) by writing:
REAL '1.23' -- string style 1.23::REAL -- PostgreSQL (historical) style
These are actually just special cases of the general casting notations discussed next.
A constant of an arbitrary type can be entered using any one of the following notations:
type
'string
' 'string
'::type
CAST ( 'string
' AStype
)
The string constant's text is passed to the input conversion
routine for the type called type
. The
result is a constant of the indicated type. The explicit type
cast can be omitted if there is no ambiguity as to the type the
constant must be (for example, when it is assigned directly to a
table column), in which case it is automatically coerced.
The string constant can be written using either regular SQL notation or dollar-quoting.
It is also possible to specify a type coercion using a function-like syntax:
typename
( 'string
' )
but not all type names can be used in this way; see Section 4.2.9 for details.
The ::
, CAST()
, and
function-call syntaxes can also be used to specify run-time type
conversions of arbitrary expressions, as discussed in Section 4.2.9. To avoid syntactic ambiguity, the
syntax can only be used to specify the type of a simple literal constant.
Another restriction on the
type
'string
'
syntax is that it does not work for array types; use type
'string
'::
or CAST()
to specify the type of an array constant.
The CAST()
syntax conforms to SQL. The
syntax is a generalization of the standard: SQL specifies this syntax only
for a few data types, but PostgreSQL allows it
for all types. The syntax with
type
'string
'::
is historical PostgreSQL
usage, as is the function-call syntax.
An operator name is a sequence of up to NAMEDATALEN
-1
(63 by default) characters from the following list:
+ - * / < > = ~ ! @ # % ^ & | ` ?
There are a few restrictions on operator names, however:
--
and /*
cannot appear
anywhere in an operator name, since they will be taken as the
start of a comment.
A multiple-character operator name cannot end in +
or -
,
unless the name also contains at least one of these characters:
~ ! @ # % ^ & | ` ?
For example, @-
is an allowed operator name,
but *-
is not. This restriction allows
PostgreSQL to parse SQL-compliant
queries without requiring spaces between tokens.
When working with non-SQL-standard operator names, you will usually
need to separate adjacent operators with spaces to avoid ambiguity.
For example, if you have defined a prefix operator named @
,
you cannot write X*@Y
; you must write
X* @Y
to ensure that
PostgreSQL reads it as two operator names
not one.
Some characters that are not alphanumeric have a special meaning that is different from being an operator. Details on the usage can be found at the location where the respective syntax element is described. This section only exists to advise the existence and summarize the purposes of these characters.
A dollar sign ($
) followed by digits is used
to represent a positional parameter in the body of a function
definition or a prepared statement. In other contexts the
dollar sign can be part of an identifier or a dollar-quoted string
constant.
Parentheses (()
) have their usual meaning to
group expressions and enforce precedence. In some cases
parentheses are required as part of the fixed syntax of a
particular SQL command.
Brackets ([]
) are used to select the elements
of an array. See Section 8.15 for more information
on arrays.
Commas (,
) are used in some syntactical
constructs to separate the elements of a list.
The semicolon (;
) terminates an SQL command.
It cannot appear anywhere within a command, except within a
string constant or quoted identifier.
The colon (:
) is used to select
“slices” from arrays. (See Section 8.15.) In certain SQL dialects (such as Embedded
SQL), the colon is used to prefix variable names.
The asterisk (*
) is used in some contexts to denote
all the fields of a table row or composite value. It also
has a special meaning when used as the argument of an
aggregate function, namely that the aggregate does not require
any explicit parameter.
The period (.
) is used in numeric
constants, and to separate schema, table, and column names.
A comment is a sequence of characters beginning with double dashes and extending to the end of the line, e.g.:
-- This is a standard SQL comment
Alternatively, C-style block comments can be used:
/* multiline comment * with nesting: /* nested block comment */ */
where the comment begins with /*
and extends to
the matching occurrence of */
. These block
comments nest, as specified in the SQL standard but unlike C, so that one can
comment out larger blocks of code that might contain existing block
comments.
A comment is removed from the input stream before further syntax analysis and is effectively replaced by whitespace.
Table 4.2 shows the precedence and associativity of the operators in PostgreSQL. Most operators have the same precedence and are left-associative. The precedence and associativity of the operators is hard-wired into the parser. Add parentheses if you want an expression with multiple operators to be parsed in some other way than what the precedence rules imply.
Table 4.2. Operator Precedence (highest to lowest)
Operator/Element | Associativity | Description |
---|---|---|
. | left | table/column name separator |
:: | left | PostgreSQL-style typecast |
[ ] | left | array element selection |
+ - | right | unary plus, unary minus |
COLLATE | left | collation selection |
AT | left | AT TIME ZONE |
^ | left | exponentiation |
* / % | left | multiplication, division, modulo |
+ - | left | addition, subtraction |
(any other operator) | left | all other native and user-defined operators |
BETWEEN IN LIKE ILIKE SIMILAR | range containment, set membership, string matching | |
< > = <= >= <>
| comparison operators | |
IS ISNULL NOTNULL | IS TRUE , IS FALSE , IS
NULL , IS DISTINCT FROM , etc | |
NOT | right | logical negation |
AND | left | logical conjunction |
OR | left | logical disjunction |
Note that the operator precedence rules also apply to user-defined operators that have the same names as the built-in operators mentioned above. For example, if you define a “+” operator for some custom data type it will have the same precedence as the built-in “+” operator, no matter what yours does.
When a schema-qualified operator name is used in the
OPERATOR
syntax, as for example in:
SELECT 3 OPERATOR(pg_catalog.+) 4;
the OPERATOR
construct is taken to have the default precedence
shown in Table 4.2 for
“any other operator”. This is true no matter
which specific operator appears inside OPERATOR()
.
PostgreSQL versions before 9.5 used slightly different
operator precedence rules. In particular, <=
>=
and <>
used to be treated as
generic operators; IS
tests used to have higher priority;
and NOT BETWEEN
and related constructs acted inconsistently,
being taken in some cases as having the precedence of NOT
rather than BETWEEN
. These rules were changed for better
compliance with the SQL standard and to reduce confusion from
inconsistent treatment of logically equivalent constructs. In most
cases, these changes will result in no behavioral change, or perhaps
in “no such operator” failures which can be resolved by adding
parentheses. However there are corner cases in which a query might
change behavior without any parsing error being reported.
Value expressions are used in a variety of contexts, such
as in the target list of the SELECT
command, as
new column values in INSERT
or
UPDATE
, or in search conditions in a number of
commands. The result of a value expression is sometimes called a
scalar, to distinguish it from the result of
a table expression (which is a table). Value expressions are
therefore also called scalar expressions (or
even simply expressions). The expression
syntax allows the calculation of values from primitive parts using
arithmetic, logical, set, and other operations.
A value expression is one of the following:
A constant or literal value
A column reference
A positional parameter reference, in the body of a function definition or prepared statement
A subscripted expression
A field selection expression
An operator invocation
A function call
An aggregate expression
A window function call
A type cast
A collation expression
A scalar subquery
An array constructor
A row constructor
Another value expression in parentheses (used to group subexpressions and override precedence)
In addition to this list, there are a number of constructs that can
be classified as an expression but do not follow any general syntax
rules. These generally have the semantics of a function or
operator and are explained in the appropriate location in Chapter 9. An example is the IS NULL
clause.
We have already discussed constants in Section 4.1.2. The following sections discuss the remaining options.
A column can be referenced in the form:
correlation
.columnname
correlation
is the name of a
table (possibly qualified with a schema name), or an alias for a table
defined by means of a FROM
clause.
The correlation name and separating dot can be omitted if the column name
is unique across all the tables being used in the current query. (See also Chapter 7.)
A positional parameter reference is used to indicate a value that is supplied externally to an SQL statement. Parameters are used in SQL function definitions and in prepared queries. Some client libraries also support specifying data values separately from the SQL command string, in which case parameters are used to refer to the out-of-line data values. The form of a parameter reference is:
$number
For example, consider the definition of a function,
dept
, as:
CREATE FUNCTION dept(text) RETURNS dept AS $$ SELECT * FROM dept WHERE name = $1 $$ LANGUAGE SQL;
Here the $1
references the value of the first
function argument whenever the function is invoked.
If an expression yields a value of an array type, then a specific element of the array value can be extracted by writing
expression
[subscript
]
or multiple adjacent elements (an “array slice”) can be extracted by writing
expression
[lower_subscript
:upper_subscript
]
(Here, the brackets [ ]
are meant to appear literally.)
Each subscript
is itself an expression,
which will be rounded to the nearest integer value.
In general the array expression
must be
parenthesized, but the parentheses can be omitted when the expression
to be subscripted is just a column reference or positional parameter.
Also, multiple subscripts can be concatenated when the original array
is multidimensional.
For example:
mytable.arraycolumn[4] mytable.two_d_column[17][34] $1[10:42] (arrayfunction(a,b))[42]
The parentheses in the last example are required. See Section 8.15 for more about arrays.
If an expression yields a value of a composite type (row type), then a specific field of the row can be extracted by writing
expression
.fieldname
In general the row expression
must be
parenthesized, but the parentheses can be omitted when the expression
to be selected from is just a table reference or positional parameter.
For example:
mytable.mycolumn $1.somecolumn (rowfunction(a,b)).col3
(Thus, a qualified column reference is actually just a special case of the field selection syntax.) An important special case is extracting a field from a table column that is of a composite type:
(compositecol).somefield (mytable.compositecol).somefield
The parentheses are required here to show that
compositecol
is a column name not a table name,
or that mytable
is a table name not a schema name
in the second case.
You can ask for all fields of a composite value by
writing .*
:
(compositecol).*
This notation behaves differently depending on context; see Section 8.16.5 for details.
There are two possible syntaxes for an operator invocation:
expression operator expression (binary infix operator) |
operator expression (unary prefix operator) |
where the operator
token follows the syntax
rules of Section 4.1.3, or is one of the
key words AND
, OR
, and
NOT
, or is a qualified operator name in the form:
OPERATOR(
schema
.
operatorname
)
Which particular operators exist and whether they are unary or binary depends on what operators have been defined by the system or the user. Chapter 9 describes the built-in operators.
The syntax for a function call is the name of a function (possibly qualified with a schema name), followed by its argument list enclosed in parentheses:
function_name
([expression
[,expression
... ]] )
For example, the following computes the square root of 2:
sqrt(2)
The list of built-in functions is in Chapter 9. Other functions can be added by the user.
When issuing queries in a database where some users mistrust other users, observe security precautions from Section 10.3 when writing function calls.
The arguments can optionally have names attached. See Section 4.3 for details.
A function that takes a single argument of composite type can
optionally be called using field-selection syntax, and conversely
field selection can be written in functional style. That is, the
notations col(table)
and table.col
are
interchangeable. This behavior is not SQL-standard but is provided
in PostgreSQL because it allows use of functions to
emulate “computed fields”. For more information see
Section 8.16.5.
An aggregate expression represents the application of an aggregate function across the rows selected by a query. An aggregate function reduces multiple inputs to a single output value, such as the sum or average of the inputs. The syntax of an aggregate expression is one of the following:
aggregate_name
(expression
[ , ... ] [order_by_clause
] ) [ FILTER ( WHEREfilter_clause
) ]aggregate_name
(ALLexpression
[ , ... ] [order_by_clause
] ) [ FILTER ( WHEREfilter_clause
) ]aggregate_name
(DISTINCTexpression
[ , ... ] [order_by_clause
] ) [ FILTER ( WHEREfilter_clause
) ]aggregate_name
( * ) [ FILTER ( WHEREfilter_clause
) ]aggregate_name
( [expression
[ , ... ] ] ) WITHIN GROUP (order_by_clause
) [ FILTER ( WHEREfilter_clause
) ]
where aggregate_name
is a previously
defined aggregate (possibly qualified with a schema name) and
expression
is
any value expression that does not itself contain an aggregate
expression or a window function call. The optional
order_by_clause
and
filter_clause
are described below.
The first form of aggregate expression invokes the aggregate
once for each input row.
The second form is the same as the first, since
ALL
is the default.
The third form invokes the aggregate once for each distinct value
of the expression (or distinct set of values, for multiple expressions)
found in the input rows.
The fourth form invokes the aggregate once for each input row; since no
particular input value is specified, it is generally only useful
for the count(*)
aggregate function.
The last form is used with ordered-set aggregate
functions, which are described below.
Most aggregate functions ignore null inputs, so that rows in which one or more of the expression(s) yield null are discarded. This can be assumed to be true, unless otherwise specified, for all built-in aggregates.
For example, count(*)
yields the total number
of input rows; count(f1)
yields the number of
input rows in which f1
is non-null, since
count
ignores nulls; and
count(distinct f1)
yields the number of
distinct non-null values of f1
.
Ordinarily, the input rows are fed to the aggregate function in an
unspecified order. In many cases this does not matter; for example,
min
produces the same result no matter what order it
receives the inputs in. However, some aggregate functions
(such as array_agg
and string_agg
) produce
results that depend on the ordering of the input rows. When using
such an aggregate, the optional order_by_clause
can be
used to specify the desired ordering. The order_by_clause
has the same syntax as for a query-level ORDER BY
clause, as
described in Section 7.5, except that its expressions
are always just expressions and cannot be output-column names or numbers.
For example:
SELECT array_agg(a ORDER BY b DESC) FROM table;
When dealing with multiple-argument aggregate functions, note that the
ORDER BY
clause goes after all the aggregate arguments.
For example, write this:
SELECT string_agg(a, ',' ORDER BY a) FROM table;
not this:
SELECT string_agg(a ORDER BY a, ',') FROM table; -- incorrect
The latter is syntactically valid, but it represents a call of a
single-argument aggregate function with two ORDER BY
keys
(the second one being rather useless since it's a constant).
If DISTINCT
is specified in addition to an
order_by_clause
, then all the ORDER BY
expressions must match regular arguments of the aggregate; that is,
you cannot sort on an expression that is not included in the
DISTINCT
list.
The ability to specify both DISTINCT
and ORDER BY
in an aggregate function is a PostgreSQL extension.
Placing ORDER BY
within the aggregate's regular argument
list, as described so far, is used when ordering the input rows for
general-purpose and statistical aggregates, for which ordering is
optional. There is a
subclass of aggregate functions called ordered-set
aggregates for which an order_by_clause
is required, usually because the aggregate's computation is
only sensible in terms of a specific ordering of its input rows.
Typical examples of ordered-set aggregates include rank and percentile
calculations. For an ordered-set aggregate,
the order_by_clause
is written
inside WITHIN GROUP (...)
, as shown in the final syntax
alternative above. The expressions in
the order_by_clause
are evaluated once per
input row just like regular aggregate arguments, sorted as per
the order_by_clause
's requirements, and fed
to the aggregate function as input arguments. (This is unlike the case
for a non-WITHIN GROUP
order_by_clause
,
which is not treated as argument(s) to the aggregate function.) The
argument expressions preceding WITHIN GROUP
, if any, are
called direct arguments to distinguish them from
the aggregated arguments listed in
the order_by_clause
. Unlike regular aggregate
arguments, direct arguments are evaluated only once per aggregate call,
not once per input row. This means that they can contain variables only
if those variables are grouped by GROUP BY
; this restriction
is the same as if the direct arguments were not inside an aggregate
expression at all. Direct arguments are typically used for things like
percentile fractions, which only make sense as a single value per
aggregation calculation. The direct argument list can be empty; in this
case, write just ()
not (*)
.
(PostgreSQL will actually accept either spelling, but
only the first way conforms to the SQL standard.)
An example of an ordered-set aggregate call is:
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY income) FROM households; percentile_cont ----------------- 50489
which obtains the 50th percentile, or median, value of
the income
column from table households
.
Here, 0.5
is a direct argument; it would make no sense
for the percentile fraction to be a value varying across rows.
If FILTER
is specified, then only the input
rows for which the filter_clause
evaluates to true are fed to the aggregate function; other rows
are discarded. For example:
SELECT count(*) AS unfiltered, count(*) FILTER (WHERE i < 5) AS filtered FROM generate_series(1,10) AS s(i); unfiltered | filtered ------------+---------- 10 | 4 (1 row)
The predefined aggregate functions are described in Section 9.21. Other aggregate functions can be added by the user.
An aggregate expression can only appear in the result list or
HAVING
clause of a SELECT
command.
It is forbidden in other clauses, such as WHERE
,
because those clauses are logically evaluated before the results
of aggregates are formed.
When an aggregate expression appears in a subquery (see
Section 4.2.11 and
Section 9.23), the aggregate is normally
evaluated over the rows of the subquery. But an exception occurs
if the aggregate's arguments (and filter_clause
if any) contain only outer-level variables:
the aggregate then belongs to the nearest such outer level, and is
evaluated over the rows of that query. The aggregate expression
as a whole is then an outer reference for the subquery it appears in,
and acts as a constant over any one evaluation of that subquery.
The restriction about
appearing only in the result list or HAVING
clause
applies with respect to the query level that the aggregate belongs to.
A window function call represents the application
of an aggregate-like function over some portion of the rows selected
by a query. Unlike non-window aggregate calls, this is not tied
to grouping of the selected rows into a single output row — each
row remains separate in the query output. However the window function
has access to all the rows that would be part of the current row's
group according to the grouping specification (PARTITION BY
list) of the window function call.
The syntax of a window function call is one of the following:
function_name
([expression
[,expression
... ]]) [ FILTER ( WHEREfilter_clause
) ] OVERwindow_name
function_name
([expression
[,expression
... ]]) [ FILTER ( WHEREfilter_clause
) ] OVER (window_definition
)function_name
( * ) [ FILTER ( WHEREfilter_clause
) ] OVERwindow_name
function_name
( * ) [ FILTER ( WHEREfilter_clause
) ] OVER (window_definition
)
where window_definition
has the syntax
[existing_window_name
] [ PARTITION BYexpression
[, ...] ] [ ORDER BYexpression
[ ASC | DESC | USINGoperator
] [ NULLS { FIRST | LAST } ] [, ...] ] [frame_clause
]
The optional frame_clause
can be one of
{ RANGE | ROWS | GROUPS }frame_start
[frame_exclusion
] { RANGE | ROWS | GROUPS } BETWEENframe_start
ANDframe_end
[frame_exclusion
]
where frame_start
and frame_end
can be one of
UNBOUNDED PRECEDINGoffset
PRECEDING CURRENT ROWoffset
FOLLOWING UNBOUNDED FOLLOWING
and frame_exclusion
can be one of
EXCLUDE CURRENT ROW EXCLUDE GROUP EXCLUDE TIES EXCLUDE NO OTHERS
Here, expression
represents any value
expression that does not itself contain window function calls.
window_name
is a reference to a named window
specification defined in the query's WINDOW
clause.
Alternatively, a full window_definition
can
be given within parentheses, using the same syntax as for defining a
named window in the WINDOW
clause; see the
SELECT reference page for details. It's worth
pointing out that OVER wname
is not exactly equivalent to
OVER (wname ...)
; the latter implies copying and modifying the
window definition, and will be rejected if the referenced window
specification includes a frame clause.
The PARTITION BY
clause groups the rows of the query into
partitions, which are processed separately by the window
function. PARTITION BY
works similarly to a query-level
GROUP BY
clause, except that its expressions are always just
expressions and cannot be output-column names or numbers.
Without PARTITION BY
, all rows produced by the query are
treated as a single partition.
The ORDER BY
clause determines the order in which the rows
of a partition are processed by the window function. It works similarly
to a query-level ORDER BY
clause, but likewise cannot use
output-column names or numbers. Without ORDER BY
, rows are
processed in an unspecified order.
The frame_clause
specifies
the set of rows constituting the window frame, which is a
subset of the current partition, for those window functions that act on
the frame instead of the whole partition. The set of rows in the frame
can vary depending on which row is the current row. The frame can be
specified in RANGE
, ROWS
or GROUPS
mode; in each case, it runs from
the frame_start
to
the frame_end
.
If frame_end
is omitted, the end defaults
to CURRENT ROW
.
A frame_start
of UNBOUNDED PRECEDING
means
that the frame starts with the first row of the partition, and similarly
a frame_end
of UNBOUNDED FOLLOWING
means
that the frame ends with the last row of the partition.
In RANGE
or GROUPS
mode,
a frame_start
of
CURRENT ROW
means the frame starts with the current
row's first peer row (a row that the
window's ORDER BY
clause sorts as equivalent to the
current row), while a frame_end
of
CURRENT ROW
means the frame ends with the current
row's last peer row.
In ROWS
mode, CURRENT ROW
simply
means the current row.
In the offset
PRECEDING
and offset
FOLLOWING
frame
options, the offset
must be an expression not
containing any variables, aggregate functions, or window functions.
The meaning of the offset
depends on the
frame mode:
In ROWS
mode,
the offset
must yield a non-null,
non-negative integer, and the option means that the frame starts or
ends the specified number of rows before or after the current row.
In GROUPS
mode,
the offset
again must yield a non-null,
non-negative integer, and the option means that the frame starts or
ends the specified number of peer groups
before or after the current row's peer group, where a peer group is a
set of rows that are equivalent in the ORDER BY
ordering. (There must be an ORDER BY
clause
in the window definition to use GROUPS
mode.)
In RANGE
mode, these options require that
the ORDER BY
clause specify exactly one column.
The offset
specifies the maximum
difference between the value of that column in the current row and
its value in preceding or following rows of the frame. The data type
of the offset
expression varies depending
on the data type of the ordering column. For numeric ordering
columns it is typically of the same type as the ordering column,
but for datetime ordering columns it is an interval
.
For example, if the ordering column is of type date
or timestamp
, one could write RANGE BETWEEN
'1 day' PRECEDING AND '10 days' FOLLOWING
.
The offset
is still required to be
non-null and non-negative, though the meaning
of “non-negative” depends on its data type.
In any case, the distance to the end of the frame is limited by the distance to the end of the partition, so that for rows near the partition ends the frame might contain fewer rows than elsewhere.
Notice that in both ROWS
and GROUPS
mode, 0 PRECEDING
and 0 FOLLOWING
are equivalent to CURRENT ROW
. This normally holds
in RANGE
mode as well, for an appropriate
data-type-specific meaning of “zero”.
The frame_exclusion
option allows rows around
the current row to be excluded from the frame, even if they would be
included according to the frame start and frame end options.
EXCLUDE CURRENT ROW
excludes the current row from the
frame.
EXCLUDE GROUP
excludes the current row and its
ordering peers from the frame.
EXCLUDE TIES
excludes any peers of the current
row from the frame, but not the current row itself.
EXCLUDE NO OTHERS
simply specifies explicitly the
default behavior of not excluding the current row or its peers.
The default framing option is RANGE UNBOUNDED PRECEDING
,
which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW
. With ORDER BY
, this sets the frame to be
all rows from the partition start up through the current row's last
ORDER BY
peer. Without ORDER BY
,
this means all rows of the partition are included in the window frame,
since all rows become peers of the current row.
Restrictions are that
frame_start
cannot be UNBOUNDED FOLLOWING
,
frame_end
cannot be UNBOUNDED PRECEDING
,
and the frame_end
choice cannot appear earlier in the
above list of frame_start
and frame_end
options than
the frame_start
choice does — for example
RANGE BETWEEN CURRENT ROW AND
is not allowed.
But, for example, offset
PRECEDINGROWS BETWEEN 7 PRECEDING AND 8
PRECEDING
is allowed, even though it would never select any
rows.
If FILTER
is specified, then only the input
rows for which the filter_clause
evaluates to true are fed to the window function; other rows
are discarded. Only window functions that are aggregates accept
a FILTER
clause.
The built-in window functions are described in Table 9.62. Other window functions can be added by the user. Also, any built-in or user-defined general-purpose or statistical aggregate can be used as a window function. (Ordered-set and hypothetical-set aggregates cannot presently be used as window functions.)
The syntaxes using *
are used for calling parameter-less
aggregate functions as window functions, for example
count(*) OVER (PARTITION BY x ORDER BY y)
.
The asterisk (*
) is customarily not used for
window-specific functions. Window-specific functions do not
allow DISTINCT
or ORDER BY
to be used within the
function argument list.
Window function calls are permitted only in the SELECT
list and the ORDER BY
clause of the query.
More information about window functions can be found in Section 3.5, Section 9.22, and Section 7.2.5.
A type cast specifies a conversion from one data type to another. PostgreSQL accepts two equivalent syntaxes for type casts:
CAST (expression
AStype
)expression
::type
The CAST
syntax conforms to SQL; the syntax with
::
is historical PostgreSQL
usage.
When a cast is applied to a value expression of a known type, it represents a run-time type conversion. The cast will succeed only if a suitable type conversion operation has been defined. Notice that this is subtly different from the use of casts with constants, as shown in Section 4.1.2.7. A cast applied to an unadorned string literal represents the initial assignment of a type to a literal constant value, and so it will succeed for any type (if the contents of the string literal are acceptable input syntax for the data type).
An explicit type cast can usually be omitted if there is no ambiguity as to the type that a value expression must produce (for example, when it is assigned to a table column); the system will automatically apply a type cast in such cases. However, automatic casting is only done for casts that are marked “OK to apply implicitly” in the system catalogs. Other casts must be invoked with explicit casting syntax. This restriction is intended to prevent surprising conversions from being applied silently.
It is also possible to specify a type cast using a function-like syntax:
typename
(expression
)
However, this only works for types whose names are also valid as
function names. For example, double precision
cannot be used this way, but the equivalent float8
can. Also, the names interval
, time
, and
timestamp
can only be used in this fashion if they are
double-quoted, because of syntactic conflicts. Therefore, the use of
the function-like cast syntax leads to inconsistencies and should
probably be avoided.
The function-like syntax is in fact just a function call. When one of the two standard cast syntaxes is used to do a run-time conversion, it will internally invoke a registered function to perform the conversion. By convention, these conversion functions have the same name as their output type, and thus the “function-like syntax” is nothing more than a direct invocation of the underlying conversion function. Obviously, this is not something that a portable application should rely on. For further details see CREATE CAST.
The COLLATE
clause overrides the collation of
an expression. It is appended to the expression it applies to:
expr
COLLATEcollation
where collation
is a possibly
schema-qualified identifier. The COLLATE
clause binds tighter than operators; parentheses can be used when
necessary.
If no collation is explicitly specified, the database system either derives a collation from the columns involved in the expression, or it defaults to the default collation of the database if no column is involved in the expression.
The two common uses of the COLLATE
clause are
overriding the sort order in an ORDER BY
clause, for
example:
SELECT a, b, c FROM tbl WHERE ... ORDER BY a COLLATE "C";
and overriding the collation of a function or operator call that has locale-sensitive results, for example:
SELECT * FROM tbl WHERE a > 'foo' COLLATE "C";
Note that in the latter case the COLLATE
clause is
attached to an input argument of the operator we wish to affect.
It doesn't matter which argument of the operator or function call the
COLLATE
clause is attached to, because the collation that is
applied by the operator or function is derived by considering all
arguments, and an explicit COLLATE
clause will override the
collations of all other arguments. (Attaching non-matching
COLLATE
clauses to more than one argument, however, is an
error. For more details see Section 24.2.)
Thus, this gives the same result as the previous example:
SELECT * FROM tbl WHERE a COLLATE "C" > 'foo';
But this is an error:
SELECT * FROM tbl WHERE (a > 'foo') COLLATE "C";
because it attempts to apply a collation to the result of the
>
operator, which is of the non-collatable data type
boolean
.
A scalar subquery is an ordinary
SELECT
query in parentheses that returns exactly one
row with one column. (See Chapter 7 for information about writing queries.)
The SELECT
query is executed
and the single returned value is used in the surrounding value expression.
It is an error to use a query that
returns more than one row or more than one column as a scalar subquery.
(But if, during a particular execution, the subquery returns no rows,
there is no error; the scalar result is taken to be null.)
The subquery can refer to variables from the surrounding query,
which will act as constants during any one evaluation of the subquery.
See also Section 9.23 for other expressions involving subqueries.
For example, the following finds the largest city population in each state:
SELECT name, (SELECT max(pop) FROM cities WHERE cities.state = states.name) FROM states;
An array constructor is an expression that builds an
array value using values for its member elements. A simple array
constructor
consists of the key word ARRAY
, a left square bracket
[
, a list of expressions (separated by commas) for the
array element values, and finally a right square bracket ]
.
For example:
SELECT ARRAY[1,2,3+4]; array --------- {1,2,7} (1 row)
By default,
the array element type is the common type of the member expressions,
determined using the same rules as for UNION
or
CASE
constructs (see Section 10.5).
You can override this by explicitly casting the array constructor to the
desired type, for example:
SELECT ARRAY[1,2,22.7]::integer[]; array ---------- {1,2,23} (1 row)
This has the same effect as casting each expression to the array element type individually. For more on casting, see Section 4.2.9.
Multidimensional array values can be built by nesting array
constructors.
In the inner constructors, the key word ARRAY
can
be omitted. For example, these produce the same result:
SELECT ARRAY[ARRAY[1,2], ARRAY[3,4]]; array --------------- {{1,2},{3,4}} (1 row) SELECT ARRAY[[1,2],[3,4]]; array --------------- {{1,2},{3,4}} (1 row)
Since multidimensional arrays must be rectangular, inner constructors
at the same level must produce sub-arrays of identical dimensions.
Any cast applied to the outer ARRAY
constructor propagates
automatically to all the inner constructors.
Multidimensional array constructor elements can be anything yielding
an array of the proper kind, not only a sub-ARRAY
construct.
For example:
CREATE TABLE arr(f1 int[], f2 int[]); INSERT INTO arr VALUES (ARRAY[[1,2],[3,4]], ARRAY[[5,6],[7,8]]); SELECT ARRAY[f1, f2, '{{9,10},{11,12}}'::int[]] FROM arr; array ------------------------------------------------ {{{1,2},{3,4}},{{5,6},{7,8}},{{9,10},{11,12}}} (1 row)
You can construct an empty array, but since it's impossible to have an array with no type, you must explicitly cast your empty array to the desired type. For example:
SELECT ARRAY[]::integer[]; array ------- {} (1 row)
It is also possible to construct an array from the results of a
subquery. In this form, the array constructor is written with the
key word ARRAY
followed by a parenthesized (not
bracketed) subquery. For example:
SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%'); array ------------------------------------------------------------------ {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412} (1 row) SELECT ARRAY(SELECT ARRAY[i, i*2] FROM generate_series(1,5) AS a(i)); array ---------------------------------- {{1,2},{2,4},{3,6},{4,8},{5,10}} (1 row)
The subquery must return a single column. If the subquery's output column is of a non-array type, the resulting one-dimensional array will have an element for each row in the subquery result, with an element type matching that of the subquery's output column. If the subquery's output column is of an array type, the result will be an array of the same type but one higher dimension; in this case all the subquery rows must yield arrays of identical dimensionality, else the result would not be rectangular.
The subscripts of an array value built with ARRAY
always begin with one. For more information about arrays, see
Section 8.15.
A row constructor is an expression that builds a row value (also
called a composite value) using values
for its member fields. A row constructor consists of the key word
ROW
, a left parenthesis, zero or more
expressions (separated by commas) for the row field values, and finally
a right parenthesis. For example:
SELECT ROW(1,2.5,'this is a test');
The key word ROW
is optional when there is more than one
expression in the list.
A row constructor can include the syntax
rowvalue
.*
,
which will be expanded to a list of the elements of the row value,
just as occurs when the .*
syntax is used at the top level
of a SELECT
list (see Section 8.16.5).
For example, if table t
has
columns f1
and f2
, these are the same:
SELECT ROW(t.*, 42) FROM t; SELECT ROW(t.f1, t.f2, 42) FROM t;
Before PostgreSQL 8.2, the
.*
syntax was not expanded in row constructors, so
that writing ROW(t.*, 42)
created a two-field row whose first
field was another row value. The new behavior is usually more useful.
If you need the old behavior of nested row values, write the inner
row value without .*
, for instance
ROW(t, 42)
.
By default, the value created by a ROW
expression is of
an anonymous record type. If necessary, it can be cast to a named
composite type — either the row type of a table, or a composite type
created with CREATE TYPE AS
. An explicit cast might be needed
to avoid ambiguity. For example:
CREATE TABLE mytable(f1 int, f2 float, f3 text); CREATE FUNCTION getf1(mytable) RETURNS int AS 'SELECT $1.f1' LANGUAGE SQL; -- No cast needed since only one getf1() exists SELECT getf1(ROW(1,2.5,'this is a test')); getf1 ------- 1 (1 row) CREATE TYPE myrowtype AS (f1 int, f2 text, f3 numeric); CREATE FUNCTION getf1(myrowtype) RETURNS int AS 'SELECT $1.f1' LANGUAGE SQL; -- Now we need a cast to indicate which function to call: SELECT getf1(ROW(1,2.5,'this is a test')); ERROR: function getf1(record) is not unique SELECT getf1(ROW(1,2.5,'this is a test')::mytable); getf1 ------- 1 (1 row) SELECT getf1(CAST(ROW(11,'this is a test',2.5) AS myrowtype)); getf1 ------- 11 (1 row)
Row constructors can be used to build composite values to be stored
in a composite-type table column, or to be passed to a function that
accepts a composite parameter. Also,
it is possible to compare two row values or test a row with
IS NULL
or IS NOT NULL
, for example:
SELECT ROW(1,2.5,'this is a test') = ROW(1, 3, 'not the same'); SELECT ROW(table.*) IS NULL FROM table; -- detect all-null rows
For more detail see Section 9.24. Row constructors can also be used in connection with subqueries, as discussed in Section 9.23.
The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.
Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all. For instance, if one wrote:
SELECT true OR somefunc();
then somefunc()
would (probably) not be called
at all. The same would be the case if one wrote:
SELECT somefunc() OR true;
Note that this is not the same as the left-to-right “short-circuiting” of Boolean operators that is found in some programming languages.
As a consequence, it is unwise to use functions with side effects
as part of complex expressions. It is particularly dangerous to
rely on side effects or evaluation order in WHERE
and HAVING
clauses,
since those clauses are extensively reprocessed as part of
developing an execution plan. Boolean
expressions (AND
/OR
/NOT
combinations) in those clauses can be reorganized
in any manner allowed by the laws of Boolean algebra.
When it is essential to force evaluation order, a CASE
construct (see Section 9.18) can be
used. For example, this is an untrustworthy way of trying to
avoid division by zero in a WHERE
clause:
SELECT ... WHERE x > 0 AND y/x > 1.5;
But this is safe:
SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;
A CASE
construct used in this fashion will defeat optimization
attempts, so it should only be done when necessary. (In this particular
example, it would be better to sidestep the problem by writing
y > 1.5*x
instead.)
CASE
is not a cure-all for such issues, however.
One limitation of the technique illustrated above is that it does not
prevent early evaluation of constant subexpressions.
As described in Section 38.7, functions and
operators marked IMMUTABLE
can be evaluated when
the query is planned rather than when it is executed. Thus for example
SELECT CASE WHEN x > 0 THEN x ELSE 1/0 END FROM tab;
is likely to result in a division-by-zero failure due to the planner
trying to simplify the constant subexpression,
even if every row in the table has x > 0
so that the
ELSE
arm would never be entered at run time.
While that particular example might seem silly, related cases that don't
obviously involve constants can occur in queries executed within
functions, since the values of function arguments and local variables
can be inserted into queries as constants for planning purposes.
Within PL/pgSQL functions, for example, using an
IF
-THEN
-ELSE
statement to protect
a risky computation is much safer than just nesting it in a
CASE
expression.
Another limitation of the same kind is that a CASE
cannot
prevent evaluation of an aggregate expression contained within it,
because aggregate expressions are computed before other
expressions in a SELECT
list or HAVING
clause
are considered. For example, the following query can cause a
division-by-zero error despite seemingly having protected against it:
SELECT CASE WHEN min(employees) > 0 THEN avg(expenses / employees) END FROM departments;
The min()
and avg()
aggregates are computed
concurrently over all the input rows, so if any row
has employees
equal to zero, the division-by-zero error
will occur before there is any opportunity to test the result of
min()
. Instead, use a WHERE
or FILTER
clause to prevent problematic input rows from
reaching an aggregate function in the first place.
PostgreSQL allows functions that have named parameters to be called using either positional or named notation. Named notation is especially useful for functions that have a large number of parameters, since it makes the associations between parameters and actual arguments more explicit and reliable. In positional notation, a function call is written with its argument values in the same order as they are defined in the function declaration. In named notation, the arguments are matched to the function parameters by name and can be written in any order. For each notation, also consider the effect of function argument types, documented in Section 10.3.
In either notation, parameters that have default values given in the function declaration need not be written in the call at all. But this is particularly useful in named notation, since any combination of parameters can be omitted; while in positional notation parameters can only be omitted from right to left.
PostgreSQL also supports mixed notation, which combines positional and named notation. In this case, positional parameters are written first and named parameters appear after them.
The following examples will illustrate the usage of all three notations, using the following function definition:
CREATE FUNCTION concat_lower_or_upper(a text, b text, uppercase boolean DEFAULT false) RETURNS text AS $$ SELECT CASE WHEN $3 THEN UPPER($1 || ' ' || $2) ELSE LOWER($1 || ' ' || $2) END; $$ LANGUAGE SQL IMMUTABLE STRICT;
Function concat_lower_or_upper
has two mandatory
parameters, a
and b
. Additionally
there is one optional parameter uppercase
which defaults
to false
. The a
and
b
inputs will be concatenated, and forced to either
upper or lower case depending on the uppercase
parameter. The remaining details of this function
definition are not important here (see Chapter 38 for
more information).
Positional notation is the traditional mechanism for passing arguments to functions in PostgreSQL. An example is:
SELECT concat_lower_or_upper('Hello', 'World', true); concat_lower_or_upper ----------------------- HELLO WORLD (1 row)
All arguments are specified in order. The result is upper case since
uppercase
is specified as true
.
Another example is:
SELECT concat_lower_or_upper('Hello', 'World'); concat_lower_or_upper ----------------------- hello world (1 row)
Here, the uppercase
parameter is omitted, so it
receives its default value of false
, resulting in
lower case output. In positional notation, arguments can be omitted
from right to left so long as they have defaults.
In named notation, each argument's name is specified using
=>
to separate it from the argument expression.
For example:
SELECT concat_lower_or_upper(a => 'Hello', b => 'World'); concat_lower_or_upper ----------------------- hello world (1 row)
Again, the argument uppercase
was omitted
so it is set to false
implicitly. One advantage of
using named notation is that the arguments may be specified in any
order, for example:
SELECT concat_lower_or_upper(a => 'Hello', b => 'World', uppercase => true); concat_lower_or_upper ----------------------- HELLO WORLD (1 row) SELECT concat_lower_or_upper(a => 'Hello', uppercase => true, b => 'World'); concat_lower_or_upper ----------------------- HELLO WORLD (1 row)
An older syntax based on ":=" is supported for backward compatibility:
SELECT concat_lower_or_upper(a := 'Hello', uppercase := true, b := 'World'); concat_lower_or_upper ----------------------- HELLO WORLD (1 row)
The mixed notation combines positional and named notation. However, as already mentioned, named arguments cannot precede positional arguments. For example:
SELECT concat_lower_or_upper('Hello', 'World', uppercase => true); concat_lower_or_upper ----------------------- HELLO WORLD (1 row)
In the above query, the arguments a
and
b
are specified positionally, while
uppercase
is specified by name. In this example,
that adds little except documentation. With a more complex function
having numerous parameters that have default values, named or mixed
notation can save a great deal of writing and reduce chances for error.
Named and mixed call notations currently cannot be used when calling an aggregate function (but they do work when an aggregate function is used as a window function).
Table of Contents
This chapter covers how one creates the database structures that will hold one's data. In a relational database, the raw data is stored in tables, so the majority of this chapter is devoted to explaining how tables are created and modified and what features are available to control what data is stored in the tables. Subsequently, we discuss how tables can be organized into schemas, and how privileges can be assigned to tables. Finally, we will briefly look at other features that affect the data storage, such as inheritance, table partitioning, views, functions, and triggers.
A table in a relational database is much like a table on paper: It consists of rows and columns. The number and order of the columns is fixed, and each column has a name. The number of rows is variable — it reflects how much data is stored at a given moment. SQL does not make any guarantees about the order of the rows in a table. When a table is read, the rows will appear in an unspecified order, unless sorting is explicitly requested. This is covered in Chapter 7. Furthermore, SQL does not assign unique identifiers to rows, so it is possible to have several completely identical rows in a table. This is a consequence of the mathematical model that underlies SQL but is usually not desirable. Later in this chapter we will see how to deal with this issue.
Each column has a data type. The data type constrains the set of possible values that can be assigned to a column and assigns semantics to the data stored in the column so that it can be used for computations. For instance, a column declared to be of a numerical type will not accept arbitrary text strings, and the data stored in such a column can be used for mathematical computations. By contrast, a column declared to be of a character string type will accept almost any kind of data but it does not lend itself to mathematical calculations, although other operations such as string concatenation are available.
PostgreSQL includes a sizable set of
built-in data types that fit many applications. Users can also
define their own data types. Most built-in data types have obvious
names and semantics, so we defer a detailed explanation to Chapter 8. Some of the frequently used data types are
integer
for whole numbers, numeric
for
possibly fractional numbers, text
for character
strings, date
for dates, time
for
time-of-day values, and timestamp
for values
containing both date and time.
To create a table, you use the aptly named CREATE TABLE command. In this command you specify at least a name for the new table, the names of the columns and the data type of each column. For example:
CREATE TABLE my_first_table ( first_column text, second_column integer );
This creates a table named my_first_table
with
two columns. The first column is named
first_column
and has a data type of
text
; the second column has the name
second_column
and the type integer
.
The table and column names follow the identifier syntax explained
in Section 4.1.1. The type names are
usually also identifiers, but there are some exceptions. Note that the
column list is comma-separated and surrounded by parentheses.
Of course, the previous example was heavily contrived. Normally, you would give names to your tables and columns that convey what kind of data they store. So let's look at a more realistic example:
CREATE TABLE products ( product_no integer, name text, price numeric );
(The numeric
type can store fractional components, as
would be typical of monetary amounts.)
When you create many interrelated tables it is wise to choose a consistent naming pattern for the tables and columns. For instance, there is a choice of using singular or plural nouns for table names, both of which are favored by some theorist or other.
There is a limit on how many columns a table can contain. Depending on the column types, it is between 250 and 1600. However, defining a table with anywhere near this many columns is highly unusual and often a questionable design.
If you no longer need a table, you can remove it using the DROP TABLE command. For example:
DROP TABLE my_first_table; DROP TABLE products;
Attempting to drop a table that does not exist is an error.
Nevertheless, it is common in SQL script files to unconditionally
try to drop each table before creating it, ignoring any error
messages, so that the script works whether or not the table exists.
(If you like, you can use the DROP TABLE IF EXISTS
variant
to avoid the error messages, but this is not standard SQL.)
If you need to modify a table that already exists, see Section 5.6 later in this chapter.
With the tools discussed so far you can create fully functional tables. The remainder of this chapter is concerned with adding features to the table definition to ensure data integrity, security, or convenience. If you are eager to fill your tables with data now you can skip ahead to Chapter 6 and read the rest of this chapter later.
A column can be assigned a default value. When a new row is created and no values are specified for some of the columns, those columns will be filled with their respective default values. A data manipulation command can also request explicitly that a column be set to its default value, without having to know what that value is. (Details about data manipulation commands are in Chapter 6.)
If no default value is declared explicitly, the default value is the null value. This usually makes sense because a null value can be considered to represent unknown data.
In a table definition, default values are listed after the column data type. For example:
CREATE TABLE products (
product_no integer,
name text,
price numeric DEFAULT 9.99
);
The default value can be an expression, which will be
evaluated whenever the default value is inserted
(not when the table is created). A common example
is for a timestamp
column to have a default of CURRENT_TIMESTAMP
,
so that it gets set to the time of row insertion. Another common
example is generating a “serial number” for each row.
In PostgreSQL this is typically done by
something like:
CREATE TABLE products (
product_no integer DEFAULT nextval('products_product_no_seq'),
...
);
where the nextval()
function supplies successive values
from a sequence object (see Section 9.17). This arrangement is sufficiently common
that there's a special shorthand for it:
CREATE TABLE products (
product_no SERIAL,
...
);
The SERIAL
shorthand is discussed further in Section 8.1.4.
A generated column is a special column that is always computed from other columns. Thus, it is for columns what a view is for tables. There are two kinds of generated columns: stored and virtual. A stored generated column is computed when it is written (inserted or updated) and occupies storage as if it were a normal column. A virtual generated column occupies no storage and is computed when it is read. Thus, a virtual generated column is similar to a view and a stored generated column is similar to a materialized view (except that it is always updated automatically). PostgreSQL currently implements only stored generated columns.
To create a generated column, use the GENERATED ALWAYS
AS
clause in CREATE TABLE
, for example:
CREATE TABLE people (
...,
height_cm numeric,
height_in numeric GENERATED ALWAYS AS (height_cm / 2.54) STORED
);
The keyword STORED
must be specified to choose the
stored kind of generated column. See CREATE TABLE for
more details.
A generated column cannot be written to directly. In
INSERT
or UPDATE
commands, a value
cannot be specified for a generated column, but the keyword
DEFAULT
may be specified.
Consider the differences between a column with a default and a generated
column. The column default is evaluated once when the row is first
inserted if no other value was provided; a generated column is updated
whenever the row changes and cannot be overridden. A column default may
not refer to other columns of the table; a generation expression would
normally do so. A column default can use volatile functions, for example
random()
or functions referring to the current time;
this is not allowed for generated columns.
Several restrictions apply to the definition of generated columns and tables involving generated columns:
The generation expression can only use immutable functions and cannot use subqueries or reference anything other than the current row in any way.
A generation expression cannot reference another generated column.
A generation expression cannot reference a system column, except
tableoid
.
A generated column cannot have a column default or an identity definition.
A generated column cannot be part of a partition key.
Foreign tables can have generated columns. See CREATE FOREIGN TABLE for details.
For inheritance:
If a parent column is a generated column, a child column must also be
a generated column using the same expression. In the definition of
the child column, leave off the GENERATED
clause,
as it will be copied from the parent.
In case of multiple inheritance, if one parent column is a generated column, then all parent columns must be generated columns and with the same expression.
If a parent column is not a generated column, a child column may be defined to be a generated column or not.
Additional considerations apply to the use of generated columns.
Generated columns maintain access privileges separately from their underlying base columns. So, it is possible to arrange it so that a particular role can read from a generated column but not from the underlying base columns.
Generated columns are, conceptually, updated after
BEFORE
triggers have run. Therefore, changes made to
base columns in a BEFORE
trigger will be reflected in
generated columns. But conversely, it is not allowed to access
generated columns in BEFORE
triggers.
Generated columns are skipped for logical replication.
Data types are a way to limit the kind of data that can be stored in a table. For many applications, however, the constraint they provide is too coarse. For example, a column containing a product price should probably only accept positive values. But there is no standard data type that accepts only positive numbers. Another issue is that you might want to constrain column data with respect to other columns or rows. For example, in a table containing product information, there should be only one row for each product number.
To that end, SQL allows you to define constraints on columns and tables. Constraints give you as much control over the data in your tables as you wish. If a user attempts to store data in a column that would violate a constraint, an error is raised. This applies even if the value came from the default value definition.
A check constraint is the most generic constraint type. It allows you to specify that the value in a certain column must satisfy a Boolean (truth-value) expression. For instance, to require positive product prices, you could use:
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0)
);
As you see, the constraint definition comes after the data type,
just like default value definitions. Default values and
constraints can be listed in any order. A check constraint
consists of the key word CHECK
followed by an
expression in parentheses. The check constraint expression should
involve the column thus constrained, otherwise the constraint
would not make too much sense.
You can also give the constraint a separate name. This clarifies error messages and allows you to refer to the constraint when you need to change it. The syntax is:
CREATE TABLE products (
product_no integer,
name text,
price numeric CONSTRAINT positive_price CHECK (price > 0)
);
So, to specify a named constraint, use the key word
CONSTRAINT
followed by an identifier followed
by the constraint definition. (If you don't specify a constraint
name in this way, the system chooses a name for you.)
A check constraint can also refer to several columns. Say you store a regular price and a discounted price, and you want to ensure that the discounted price is lower than the regular price:
CREATE TABLE products (
product_no integer,
name text,
price numeric CHECK (price > 0),
discounted_price numeric CHECK (discounted_price > 0),
CHECK (price > discounted_price)
);
The first two constraints should look familiar. The third one uses a new syntax. It is not attached to a particular column, instead it appears as a separate item in the comma-separated column list. Column definitions and these constraint definitions can be listed in mixed order.
We say that the first two constraints are column constraints, whereas the third one is a table constraint because it is written separately from any one column definition. Column constraints can also be written as table constraints, while the reverse is not necessarily possible, since a column constraint is supposed to refer to only the column it is attached to. (PostgreSQL doesn't enforce that rule, but you should follow it if you want your table definitions to work with other database systems.) The above example could also be written as:
CREATE TABLE products ( product_no integer, name text, price numeric, CHECK (price > 0), discounted_price numeric, CHECK (discounted_price > 0), CHECK (price > discounted_price) );
or even:
CREATE TABLE products ( product_no integer, name text, price numeric CHECK (price > 0), discounted_price numeric, CHECK (discounted_price > 0 AND price > discounted_price) );
It's a matter of taste.
Names can be assigned to table constraints in the same way as column constraints:
CREATE TABLE products (
product_no integer,
name text,
price numeric,
CHECK (price > 0),
discounted_price numeric,
CHECK (discounted_price > 0),
CONSTRAINT valid_discount CHECK (price > discounted_price)
);
It should be noted that a check constraint is satisfied if the check expression evaluates to true or the null value. Since most expressions will evaluate to the null value if any operand is null, they will not prevent null values in the constrained columns. To ensure that a column does not contain null values, the not-null constraint described in the next section can be used.
PostgreSQL does not support
CHECK
constraints that reference table data other than
the new or updated row being checked. While a CHECK
constraint that violates this rule may appear to work in simple
tests, it cannot guarantee that the database will not reach a state
in which the constraint condition is false (due to subsequent changes
of the other row(s) involved). This would cause a database dump and
restore to fail. The restore could fail even when the complete
database state is consistent with the constraint, due to rows not
being loaded in an order that will satisfy the constraint. If
possible, use UNIQUE
, EXCLUDE
,
or FOREIGN KEY
constraints to express
cross-row and cross-table restrictions.
If what you desire is a one-time check against other rows at row insertion, rather than a continuously-maintained consistency guarantee, a custom trigger can be used to implement that. (This approach avoids the dump/restore problem because pg_dump does not reinstall triggers until after restoring data, so that the check will not be enforced during a dump/restore.)
PostgreSQL assumes that
CHECK
constraints' conditions are immutable, that
is, they will always give the same result for the same input row.
This assumption is what justifies examining CHECK
constraints only when rows are inserted or updated, and not at other
times. (The warning above about not referencing other table data is
really a special case of this restriction.)
An example of a common way to break this assumption is to reference a
user-defined function in a CHECK
expression, and
then change the behavior of that
function. PostgreSQL does not disallow
that, but it will not notice if there are rows in the table that now
violate the CHECK
constraint. That would cause a
subsequent database dump and restore to fail.
The recommended way to handle such a change is to drop the constraint
(using ALTER TABLE
), adjust the function definition,
and re-add the constraint, thereby rechecking it against all table rows.
A not-null constraint simply specifies that a column must not assume the null value. A syntax example:
CREATE TABLE products ( product_no integer NOT NULL, name text NOT NULL, price numeric );
A not-null constraint is always written as a column constraint. A
not-null constraint is functionally equivalent to creating a check
constraint CHECK (
, but in
PostgreSQL creating an explicit
not-null constraint is more efficient. The drawback is that you
cannot give explicit names to not-null constraints created this
way.
column_name
IS NOT NULL)
Of course, a column can have more than one constraint. Just write the constraints one after another:
CREATE TABLE products ( product_no integer NOT NULL, name text NOT NULL, price numeric NOT NULL CHECK (price > 0) );
The order doesn't matter. It does not necessarily determine in which order the constraints are checked.
The NOT NULL
constraint has an inverse: the
NULL
constraint. This does not mean that the
column must be null, which would surely be useless. Instead, this
simply selects the default behavior that the column might be null.
The NULL
constraint is not present in the SQL
standard and should not be used in portable applications. (It was
only added to PostgreSQL to be
compatible with some other database systems.) Some users, however,
like it because it makes it easy to toggle the constraint in a
script file. For example, you could start with:
CREATE TABLE products ( product_no integer NULL, name text NULL, price numeric NULL );
and then insert the NOT
key word where desired.
In most database designs the majority of columns should be marked not null.
Unique constraints ensure that the data contained in a column, or a group of columns, is unique among all the rows in the table. The syntax is:
CREATE TABLE products (
product_no integer UNIQUE,
name text,
price numeric
);
when written as a column constraint, and:
CREATE TABLE products (
product_no integer,
name text,
price numeric,
UNIQUE (product_no)
);
when written as a table constraint.
To define a unique constraint for a group of columns, write it as a table constraint with the column names separated by commas:
CREATE TABLE example (
a integer,
b integer,
c integer,
UNIQUE (a, c)
);
This specifies that the combination of values in the indicated columns is unique across the whole table, though any one of the columns need not be (and ordinarily isn't) unique.
You can assign your own name for a unique constraint, in the usual way:
CREATE TABLE products (
product_no integer CONSTRAINT must_be_different UNIQUE,
name text,
price numeric
);
Adding a unique constraint will automatically create a unique B-tree index on the column or group of columns listed in the constraint. A uniqueness restriction covering only some rows cannot be written as a unique constraint, but it is possible to enforce such a restriction by creating a unique partial index.
In general, a unique constraint is violated if there is more than one row in the table where the values of all of the columns included in the constraint are equal. However, two null values are never considered equal in this comparison. That means even in the presence of a unique constraint it is possible to store duplicate rows that contain a null value in at least one of the constrained columns. This behavior conforms to the SQL standard, but we have heard that other SQL databases might not follow this rule. So be careful when developing applications that are intended to be portable.
A primary key constraint indicates that a column, or group of columns, can be used as a unique identifier for rows in the table. This requires that the values be both unique and not null. So, the following two table definitions accept the same data:
CREATE TABLE products ( product_no integer UNIQUE NOT NULL, name text, price numeric );
CREATE TABLE products (
product_no integer PRIMARY KEY,
name text,
price numeric
);
Primary keys can span more than one column; the syntax is similar to unique constraints:
CREATE TABLE example (
a integer,
b integer,
c integer,
PRIMARY KEY (a, c)
);
Adding a primary key will automatically create a unique B-tree index
on the column or group of columns listed in the primary key, and will
force the column(s) to be marked NOT NULL
.
A table can have at most one primary key. (There can be any number of unique and not-null constraints, which are functionally almost the same thing, but only one can be identified as the primary key.) Relational database theory dictates that every table must have a primary key. This rule is not enforced by PostgreSQL, but it is usually best to follow it.
Primary keys are useful both for documentation purposes and for client applications. For example, a GUI application that allows modifying row values probably needs to know the primary key of a table to be able to identify rows uniquely. There are also various ways in which the database system makes use of a primary key if one has been declared; for example, the primary key defines the default target column(s) for foreign keys referencing its table.
A foreign key constraint specifies that the values in a column (or a group of columns) must match the values appearing in some row of another table. We say this maintains the referential integrity between two related tables.
Say you have the product table that we have used several times already:
CREATE TABLE products ( product_no integer PRIMARY KEY, name text, price numeric );
Let's also assume you have a table storing orders of those products. We want to ensure that the orders table only contains orders of products that actually exist. So we define a foreign key constraint in the orders table that references the products table:
CREATE TABLE orders (
order_id integer PRIMARY KEY,
product_no integer REFERENCES products (product_no),
quantity integer
);
Now it is impossible to create orders with non-NULL
product_no
entries that do not appear in the
products table.
We say that in this situation the orders table is the referencing table and the products table is the referenced table. Similarly, there are referencing and referenced columns.
You can also shorten the above command to:
CREATE TABLE orders (
order_id integer PRIMARY KEY,
product_no integer REFERENCES products,
quantity integer
);
because in absence of a column list the primary key of the referenced table is used as the referenced column(s).
You can assign your own name for a foreign key constraint, in the usual way.
A foreign key can also constrain and reference a group of columns. As usual, it then needs to be written in table constraint form. Here is a contrived syntax example:
CREATE TABLE t1 (
a integer PRIMARY KEY,
b integer,
c integer,
FOREIGN KEY (b, c) REFERENCES other_table (c1, c2)
);
Of course, the number and type of the constrained columns need to match the number and type of the referenced columns.
Sometimes it is useful for the “other table” of a foreign key constraint to be the same table; this is called a self-referential foreign key. For example, if you want rows of a table to represent nodes of a tree structure, you could write
CREATE TABLE tree ( node_id integer PRIMARY KEY, parent_id integer REFERENCES tree, name text, ... );
A top-level node would have NULL parent_id
,
while non-NULL parent_id
entries would be
constrained to reference valid rows of the table.
A table can have more than one foreign key constraint. This is used to implement many-to-many relationships between tables. Say you have tables about products and orders, but now you want to allow one order to contain possibly many products (which the structure above did not allow). You could use this table structure:
CREATE TABLE products ( product_no integer PRIMARY KEY, name text, price numeric ); CREATE TABLE orders ( order_id integer PRIMARY KEY, shipping_address text, ... ); CREATE TABLE order_items ( product_no integer REFERENCES products, order_id integer REFERENCES orders, quantity integer, PRIMARY KEY (product_no, order_id) );
Notice that the primary key overlaps with the foreign keys in the last table.
We know that the foreign keys disallow creation of orders that do not relate to any products. But what if a product is removed after an order is created that references it? SQL allows you to handle that as well. Intuitively, we have a few options:
Disallow deleting a referenced product
Delete the orders as well
Something else?
To illustrate this, let's implement the following policy on the
many-to-many relationship example above: when someone wants to
remove a product that is still referenced by an order (via
order_items
), we disallow it. If someone
removes an order, the order items are removed as well:
CREATE TABLE products ( product_no integer PRIMARY KEY, name text, price numeric ); CREATE TABLE orders ( order_id integer PRIMARY KEY, shipping_address text, ... ); CREATE TABLE order_items ( product_no integer REFERENCES products ON DELETE RESTRICT, order_id integer REFERENCES orders ON DELETE CASCADE, quantity integer, PRIMARY KEY (product_no, order_id) );
Restricting and cascading deletes are the two most common options.
RESTRICT
prevents deletion of a
referenced row. NO ACTION
means that if any
referencing rows still exist when the constraint is checked, an error
is raised; this is the default behavior if you do not specify anything.
(The essential difference between these two choices is that
NO ACTION
allows the check to be deferred until
later in the transaction, whereas RESTRICT
does not.)
CASCADE
specifies that when a referenced row is deleted,
row(s) referencing it should be automatically deleted as well.
There are two other options:
SET NULL
and SET DEFAULT
.
These cause the referencing column(s) in the referencing row(s)
to be set to nulls or their default
values, respectively, when the referenced row is deleted.
Note that these do not excuse you from observing any constraints.
For example, if an action specifies SET DEFAULT
but the default value would not satisfy the foreign key constraint, the
operation will fail.
Analogous to ON DELETE
there is also
ON UPDATE
which is invoked when a referenced
column is changed (updated). The possible actions are the same.
In this case, CASCADE
means that the updated values of the
referenced column(s) should be copied into the referencing row(s).
Normally, a referencing row need not satisfy the foreign key constraint
if any of its referencing columns are null. If MATCH FULL
is added to the foreign key declaration, a referencing row escapes
satisfying the constraint only if all its referencing columns are null
(so a mix of null and non-null values is guaranteed to fail a
MATCH FULL
constraint). If you don't want referencing rows
to be able to avoid satisfying the foreign key constraint, declare the
referencing column(s) as NOT NULL
.
A foreign key must reference columns that either are a primary key or
form a unique constraint, or are columns from a non-partial unique index.
This means that the referenced columns always have an index to allow
efficient lookups on whether a referencing row has a match. Since a
DELETE
of a row from the referenced table or an
UPDATE
of a referenced column will require a scan of
the referencing table for rows matching the old value, it is often a good
idea to index the referencing columns too. Because this is not always
needed, and there are many choices available on how to index, the
declaration of a foreign key constraint does not automatically create an
index on the referencing columns.
More information about updating and deleting data is in Chapter 6. Also see the description of foreign key constraint syntax in the reference documentation for CREATE TABLE.
Exclusion constraints ensure that if any two rows are compared on the specified columns or expressions using the specified operators, at least one of these operator comparisons will return false or null. The syntax is:
CREATE TABLE circles ( c circle, EXCLUDE USING gist (c WITH &&) );
See also CREATE
TABLE ... CONSTRAINT ... EXCLUDE
for details.
Adding an exclusion constraint will automatically create an index of the type specified in the constraint declaration.
Every table has several system columns that are implicitly defined by the system. Therefore, these names cannot be used as names of user-defined columns. (Note that these restrictions are separate from whether the name is a key word or not; quoting a name will not allow you to escape these restrictions.) You do not really need to be concerned about these columns; just know they exist.
tableoid
The OID of the table containing this row. This column is
particularly handy for queries that select from partitioned
tables (see Section 5.11) or inheritance
hierarchies (see Section 5.10), since without it,
it's difficult to tell which individual table a row came from. The
tableoid
can be joined against the
oid
column of
pg_class
to obtain the table name.
xmin
The identity (transaction ID) of the inserting transaction for this row version. (A row version is an individual state of a row; each update of a row creates a new row version for the same logical row.)
cmin
The command identifier (starting at zero) within the inserting transaction.
xmax
The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It is possible for this column to be nonzero in a visible row version. That usually indicates that the deleting transaction hasn't committed yet, or that an attempted deletion was rolled back.
cmax
The command identifier within the deleting transaction, or zero.
ctid
The physical location of the row version within its table. Note that
although the ctid
can be used to
locate the row version very quickly, a row's
ctid
will change if it is
updated or moved by VACUUM FULL
. Therefore
ctid
is useless as a long-term row
identifier. A primary key should be used to identify logical rows.
Transaction identifiers are also 32-bit quantities. In a long-lived database it is possible for transaction IDs to wrap around. This is not a fatal problem given appropriate maintenance procedures; see Chapter 25 for details. It is unwise, however, to depend on the uniqueness of transaction IDs over the long term (more than one billion transactions).
Command identifiers are also 32-bit quantities. This creates a hard limit of 232 (4 billion) SQL commands within a single transaction. In practice this limit is not a problem — note that the limit is on the number of SQL commands, not the number of rows processed. Also, only commands that actually modify the database contents will consume a command identifier.
When you create a table and you realize that you made a mistake, or the requirements of the application change, you can drop the table and create it again. But this is not a convenient option if the table is already filled with data, or if the table is referenced by other database objects (for instance a foreign key constraint). Therefore PostgreSQL provides a family of commands to make modifications to existing tables. Note that this is conceptually distinct from altering the data contained in the table: here we are interested in altering the definition, or structure, of the table.
You can:
Add columns
Remove columns
Add constraints
Remove constraints
Change default values
Change column data types
Rename columns
Rename tables
All these actions are performed using the ALTER TABLE command, whose reference page contains details beyond those given here.
To add a column, use a command like:
ALTER TABLE products ADD COLUMN description text;
The new column is initially filled with whatever default
value is given (null if you don't specify a DEFAULT
clause).
From PostgreSQL 11, adding a column with
a constant default value no longer means that each row of the table
needs to be updated when the ALTER TABLE
statement
is executed. Instead, the default value will be returned the next time
the row is accessed, and applied when the table is rewritten, making
the ALTER TABLE
very fast even on large tables.
However, if the default value is volatile (e.g.,
clock_timestamp()
)
each row will need to be updated with the value calculated at the time
ALTER TABLE
is executed. To avoid a potentially
lengthy update operation, particularly if you intend to fill the column
with mostly nondefault values anyway, it may be preferable to add the
column with no default, insert the correct values using
UPDATE
, and then add any desired default as described
below.
You can also define constraints on the column at the same time, using the usual syntax:
ALTER TABLE products ADD COLUMN description text CHECK (description <> '');
In fact all the options that can be applied to a column description
in CREATE TABLE
can be used here. Keep in mind however
that the default value must satisfy the given constraints, or the
ADD
will fail. Alternatively, you can add
constraints later (see below) after you've filled in the new column
correctly.
To remove a column, use a command like:
ALTER TABLE products DROP COLUMN description;
Whatever data was in the column disappears. Table constraints involving
the column are dropped, too. However, if the column is referenced by a
foreign key constraint of another table,
PostgreSQL will not silently drop that
constraint. You can authorize dropping everything that depends on
the column by adding CASCADE
:
ALTER TABLE products DROP COLUMN description CASCADE;
See Section 5.14 for a description of the general mechanism behind this.
To add a constraint, the table constraint syntax is used. For example:
ALTER TABLE products ADD CHECK (name <> ''); ALTER TABLE products ADD CONSTRAINT some_name UNIQUE (product_no); ALTER TABLE products ADD FOREIGN KEY (product_group_id) REFERENCES product_groups;
To add a not-null constraint, which cannot be written as a table constraint, use this syntax:
ALTER TABLE products ALTER COLUMN product_no SET NOT NULL;
The constraint will be checked immediately, so the table data must satisfy the constraint before it can be added.
To remove a constraint you need to know its name. If you gave it
a name then that's easy. Otherwise the system assigned a
generated name, which you need to find out. The
psql command \d
can be helpful
here; other interfaces might also provide a way to inspect table
details. Then the command is:
tablename
ALTER TABLE products DROP CONSTRAINT some_name;
(If you are dealing with a generated constraint name like $2
,
don't forget that you'll need to double-quote it to make it a valid
identifier.)
As with dropping a column, you need to add CASCADE
if you
want to drop a constraint that something else depends on. An example
is that a foreign key constraint depends on a unique or primary key
constraint on the referenced column(s).
This works the same for all constraint types except not-null constraints. To drop a not null constraint use:
ALTER TABLE products ALTER COLUMN product_no DROP NOT NULL;
(Recall that not-null constraints do not have names.)
To set a new default for a column, use a command like:
ALTER TABLE products ALTER COLUMN price SET DEFAULT 7.77;
Note that this doesn't affect any existing rows in the table, it
just changes the default for future INSERT
commands.
To remove any default value, use:
ALTER TABLE products ALTER COLUMN price DROP DEFAULT;
This is effectively the same as setting the default to null. As a consequence, it is not an error to drop a default where one hadn't been defined, because the default is implicitly the null value.
To convert a column to a different data type, use a command like:
ALTER TABLE products ALTER COLUMN price TYPE numeric(10,2);
This will succeed only if each existing entry in the column can be
converted to the new type by an implicit cast. If a more complex
conversion is needed, you can add a USING
clause that
specifies how to compute the new values from the old.
PostgreSQL will attempt to convert the column's default value (if any) to the new type, as well as any constraints that involve the column. But these conversions might fail, or might produce surprising results. It's often best to drop any constraints on the column before altering its type, and then add back suitably modified constraints afterwards.
When an object is created, it is assigned an owner. The owner is normally the role that executed the creation statement. For most kinds of objects, the initial state is that only the owner (or a superuser) can do anything with the object. To allow other roles to use it, privileges must be granted.
There are different kinds of privileges: SELECT
,
INSERT
, UPDATE
, DELETE
,
TRUNCATE
, REFERENCES
, TRIGGER
,
CREATE
, CONNECT
, TEMPORARY
,
EXECUTE
, and USAGE
.
The privileges applicable to a particular
object vary depending on the object's type (table, function, etc).
More detail about the meanings of these privileges appears below.
The following sections and chapters will also show you how
these privileges are used.
The right to modify or destroy an object is inherent in being the object's owner, and cannot be granted or revoked in itself. (However, like all privileges, that right can be inherited by members of the owning role; see Section 22.3.)
An object can be assigned to a new owner with an ALTER
command of the appropriate kind for the object, for example
ALTER TABLEtable_name
OWNER TOnew_owner
;
Superusers can always do this; ordinary roles can only do it if they are both the current owner of the object (or a member of the owning role) and a member of the new owning role.
To assign privileges, the GRANT command is
used. For example, if joe
is an existing role, and
accounts
is an existing table, the privilege to
update the table can be granted with:
GRANT UPDATE ON accounts TO joe;
Writing ALL
in place of a specific privilege grants all
privileges that are relevant for the object type.
The special “role” name PUBLIC
can
be used to grant a privilege to every role on the system. Also,
“group” roles can be set up to help manage privileges when
there are many users of a database — for details see
Chapter 22.
To revoke a previously-granted privilege, use the fittingly named REVOKE command:
REVOKE ALL ON accounts FROM PUBLIC;
Ordinarily, only the object's owner (or a superuser) can grant or revoke privileges on an object. However, it is possible to grant a privilege “with grant option”, which gives the recipient the right to grant it in turn to others. If the grant option is subsequently revoked then all who received the privilege from that recipient (directly or through a chain of grants) will lose the privilege. For details see the GRANT and REVOKE reference pages.
An object's owner can choose to revoke their own ordinary privileges, for example to make a table read-only for themselves as well as others. But owners are always treated as holding all grant options, so they can always re-grant their own privileges.
The available privileges are:
SELECT
Allows SELECT
from
any column, or specific column(s), of a table, view, materialized
view, or other table-like object.
Also allows use of COPY TO
.
This privilege is also needed to reference existing column values in
UPDATE
or DELETE
.
For sequences, this privilege also allows use of the
currval
function.
For large objects, this privilege allows the object to be read.
INSERT
Allows INSERT
of a new row into a table, view,
etc. Can be granted on specific column(s), in which case
only those columns may be assigned to in the INSERT
command (other columns will therefore receive default values).
Also allows use of COPY FROM
.
UPDATE
Allows UPDATE
of any
column, or specific column(s), of a table, view, etc.
(In practice, any nontrivial UPDATE
command will
require SELECT
privilege as well, since it must
reference table columns to determine which rows to update, and/or to
compute new values for columns.)
SELECT ... FOR UPDATE
and SELECT ... FOR SHARE
also require this privilege on at least one column, in addition to the
SELECT
privilege. For sequences, this
privilege allows use of the nextval
and
setval
functions.
For large objects, this privilege allows writing or truncating the
object.
DELETE
Allows DELETE
of a row from a table, view, etc.
(In practice, any nontrivial DELETE
command will
require SELECT
privilege as well, since it must
reference table columns to determine which rows to delete.)
TRUNCATE
Allows TRUNCATE
on a table.
REFERENCES
Allows creation of a foreign key constraint referencing a table, or specific column(s) of a table.
TRIGGER
Allows creation of a trigger on a table, view, etc.
CREATE
For databases, allows new schemas and publications to be created within the database, and allows trusted extensions to be installed within the database.
For schemas, allows new objects to be created within the schema. To rename an existing object, you must own the object and have this privilege for the containing schema.
For tablespaces, allows tables, indexes, and temporary files to be created within the tablespace, and allows databases to be created that have the tablespace as their default tablespace.
Note that revoking this privilege will not alter the existence or location of existing objects.
CONNECT
Allows the grantee to connect to the database. This
privilege is checked at connection startup (in addition to checking
any restrictions imposed by pg_hba.conf
).
TEMPORARY
Allows temporary tables to be created while using the database.
EXECUTE
Allows calling a function or procedure, including use of any operators that are implemented on top of the function. This is the only type of privilege that is applicable to functions and procedures.
USAGE
For procedural languages, allows use of the language for the creation of functions in that language. This is the only type of privilege that is applicable to procedural languages.
For schemas, allows access to objects contained in the schema (assuming that the objects' own privilege requirements are also met). Essentially this allows the grantee to “look up” objects within the schema. Without this permission, it is still possible to see the object names, e.g., by querying system catalogs. Also, after revoking this permission, existing sessions might have statements that have previously performed this lookup, so this is not a completely secure way to prevent object access.
For sequences, allows use of the
currval
and nextval
functions.
For types and domains, allows use of the type or domain in the creation of tables, functions, and other schema objects. (Note that this privilege does not control all “usage” of the type, such as values of the type appearing in queries. It only prevents objects from being created that depend on the type. The main purpose of this privilege is controlling which users can create dependencies on a type, which could prevent the owner from changing the type later.)
For foreign-data wrappers, allows creation of new servers using the foreign-data wrapper.
For foreign servers, allows creation of foreign tables using the server. Grantees may also create, alter, or drop their own user mappings associated with that server.
The privileges required by other commands are listed on the reference page of the respective command.
PostgreSQL grants privileges on some types of objects to
PUBLIC
by default when the objects are created.
No privileges are granted to PUBLIC
by default on
tables,
table columns,
sequences,
foreign data wrappers,
foreign servers,
large objects,
schemas,
or tablespaces.
For other types of objects, the default privileges
granted to PUBLIC
are as follows:
CONNECT
and TEMPORARY
(create
temporary tables) privileges for databases;
EXECUTE
privilege for functions and procedures; and
USAGE
privilege for languages and data types
(including domains).
The object owner can, of course, REVOKE
both default and expressly granted privileges. (For maximum
security, issue the REVOKE
in the same transaction that
creates the object; then there is no window in which another user
can use the object.)
Also, these default privilege settings can be overridden using the
ALTER DEFAULT PRIVILEGES command.
Table 5.1 shows the one-letter abbreviations that are used for these privilege types in ACL (Access Control List) values. You will see these letters in the output of the psql commands listed below, or when looking at ACL columns of system catalogs.
Table 5.1. ACL Privilege Abbreviations
Privilege | Abbreviation | Applicable Object Types |
---|---|---|
SELECT | r (“read”) |
LARGE OBJECT ,
SEQUENCE ,
TABLE (and table-like objects),
table column
|
INSERT | a (“append”) | TABLE , table column |
UPDATE | w (“write”) |
LARGE OBJECT ,
SEQUENCE ,
TABLE ,
table column
|
DELETE | d | TABLE |
TRUNCATE | D | TABLE |
REFERENCES | x | TABLE , table column |
TRIGGER | t | TABLE |
CREATE | C |
DATABASE ,
SCHEMA ,
TABLESPACE
|
CONNECT | c | DATABASE |
TEMPORARY | T | DATABASE |
EXECUTE | X | FUNCTION , PROCEDURE |
USAGE | U |
DOMAIN ,
FOREIGN DATA WRAPPER ,
FOREIGN SERVER ,
LANGUAGE ,
SCHEMA ,
SEQUENCE ,
TYPE
|
Table 5.2 summarizes the privileges available for each type of SQL object, using the abbreviations shown above. It also shows the psql command that can be used to examine privilege settings for each object type.
Table 5.2. Summary of Access Privileges
Object Type | All Privileges | Default PUBLIC Privileges | psql Command |
---|---|---|---|
DATABASE | CTc | Tc | \l |
DOMAIN | U | U | \dD+ |
FUNCTION or PROCEDURE | X | X | \df+ |
FOREIGN DATA WRAPPER | U | none | \dew+ |
FOREIGN SERVER | U | none | \des+ |
LANGUAGE | U | U | \dL+ |
LARGE OBJECT | rw | none | |
SCHEMA | UC | none | \dn+ |
SEQUENCE | rwU | none | \dp |
TABLE (and table-like objects) | arwdDxt | none | \dp |
Table column | arwx | none | \dp |
TABLESPACE | C | none | \db+ |
TYPE | U | U | \dT+ |
The privileges that have been granted for a particular object are
displayed as a list of aclitem
entries, each having the
format:
grantee
=
privilege-abbreviation
[*
].../
grantor
Each aclitem
lists all the permissions of one grantee that
have been granted by a particular grantor. Specific privileges are
represented by one-letter abbreviations from
Table 5.1, with *
appended if the privilege was granted with grant option. For example,
calvin=r*w/hobbes
specifies that the role
calvin
has the privilege
SELECT
(r
) with grant option
(*
) as well as the non-grantable
privilege UPDATE
(w
), both granted
by the role hobbes
. If calvin
also has some privileges on the same object granted by a different
grantor, those would appear as a separate aclitem
entry.
An empty grantee field in an aclitem
stands
for PUBLIC
.
As an example, suppose that user miriam
creates
table mytable
and does:
GRANT SELECT ON mytable TO PUBLIC; GRANT SELECT, UPDATE, INSERT ON mytable TO admin; GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw;
Then psql's \dp
command
would show:
=> \dp mytable Access privileges Schema | Name | Type | Access privileges | Column privileges | Policies --------+---------+-------+-----------------------+-----------------------+---------- public | mytable | table | miriam=arwdDxt/miriam+| col1: +| | | | =r/miriam +| miriam_rw=rw/miriam | | | | admin=arw/miriam | | (1 row)
If the “Access privileges” column is empty for a given
object, it means the object has default privileges (that is, its
privileges entry in the relevant system catalog is null). Default
privileges always include all privileges for the owner, and can include
some privileges for PUBLIC
depending on the object
type, as explained above. The first GRANT
or REVOKE
on an object will instantiate the default
privileges (producing, for
example, miriam=arwdDxt/miriam
) and then modify them
per the specified request. Similarly, entries are shown in “Column
privileges” only for columns with nondefault privileges.
(Note: for this purpose, “default privileges” always means
the built-in default privileges for the object's type. An object whose
privileges have been affected by an ALTER DEFAULT
PRIVILEGES
command will always be shown with an explicit
privilege entry that includes the effects of
the ALTER
.)
Notice that the owner's implicit grant options are not marked in the
access privileges display. A *
will appear only when
grant options have been explicitly granted to someone.
In addition to the SQL-standard privilege system available through GRANT, tables can have row security policies that restrict, on a per-user basis, which rows can be returned by normal queries or inserted, updated, or deleted by data modification commands. This feature is also known as Row-Level Security. By default, tables do not have any policies, so that if a user has access privileges to a table according to the SQL privilege system, all rows within it are equally available for querying or updating.
When row security is enabled on a table (with
ALTER TABLE ... ENABLE ROW LEVEL
SECURITY), all normal access to the table for selecting rows or
modifying rows must be allowed by a row security policy. (However, the
table's owner is typically not subject to row security policies.) If no
policy exists for the table, a default-deny policy is used, meaning that
no rows are visible or can be modified. Operations that apply to the
whole table, such as TRUNCATE
and REFERENCES
,
are not subject to row security.
Row security policies can be specific to commands, or to roles, or to
both. A policy can be specified to apply to ALL
commands, or to SELECT
, INSERT
, UPDATE
,
or DELETE
. Multiple roles can be assigned to a given
policy, and normal role membership and inheritance rules apply.
To specify which rows are visible or modifiable according to a policy,
an expression is required that returns a Boolean result. This
expression will be evaluated for each row prior to any conditions or
functions coming from the user's query. (The only exceptions to this
rule are leakproof
functions, which are guaranteed to
not leak information; the optimizer may choose to apply such functions
ahead of the row-security check.) Rows for which the expression does
not return true
will not be processed. Separate expressions
may be specified to provide independent control over the rows which are
visible and the rows which are allowed to be modified. Policy
expressions are run as part of the query and with the privileges of the
user running the query, although security-definer functions can be used
to access data not available to the calling user.
Superusers and roles with the BYPASSRLS
attribute always
bypass the row security system when accessing a table. Table owners
normally bypass row security as well, though a table owner can choose to
be subject to row security with ALTER
TABLE ... FORCE ROW LEVEL SECURITY.
Enabling and disabling row security, as well as adding policies to a table, is always the privilege of the table owner only.
Policies are created using the CREATE POLICY command, altered using the ALTER POLICY command, and dropped using the DROP POLICY command. To enable and disable row security for a given table, use the ALTER TABLE command.
Each policy has a name and multiple policies can be defined for a table. As policies are table-specific, each policy for a table must have a unique name. Different tables may have policies with the same name.
When multiple policies apply to a given query, they are combined using
either OR
(for permissive policies, which are the
default) or using AND
(for restrictive policies).
This is similar to the rule that a given role has the privileges
of all roles that they are a member of. Permissive vs. restrictive
policies are discussed further below.
As a simple example, here is how to create a policy on
the account
relation to allow only members of
the managers
role to access rows, and only rows of their
accounts:
CREATE TABLE accounts (manager text, company text, contact_email text); ALTER TABLE accounts ENABLE ROW LEVEL SECURITY; CREATE POLICY account_managers ON accounts TO managers USING (manager = current_user);
The policy above implicitly provides a WITH CHECK
clause identical to its USING
clause, so that the
constraint applies both to rows selected by a command (so a manager
cannot SELECT
, UPDATE
,
or DELETE
existing rows belonging to a different
manager) and to rows modified by a command (so rows belonging to a
different manager cannot be created via INSERT
or UPDATE
).
If no role is specified, or the special user name
PUBLIC
is used, then the policy applies to all
users on the system. To allow all users to access only their own row in
a users
table, a simple policy can be used:
CREATE POLICY user_policy ON users USING (user_name = current_user);
This works similarly to the previous example.
To use a different policy for rows that are being added to the table
compared to those rows that are visible, multiple policies can be
combined. This pair of policies would allow all users to view all rows
in the users
table, but only modify their own:
CREATE POLICY user_sel_policy ON users FOR SELECT USING (true); CREATE POLICY user_mod_policy ON users USING (user_name = current_user);
In a SELECT
command, these two policies are combined
using OR
, with the net effect being that all rows
can be selected. In other command types, only the second policy applies,
so that the effects are the same as before.
Row security can also be disabled with the ALTER TABLE
command. Disabling row security does not remove any policies that are
defined on the table; they are simply ignored. Then all rows in the
table are visible and modifiable, subject to the standard SQL privileges
system.
Below is a larger example of how this feature can be used in production
environments. The table passwd
emulates a Unix password
file:
-- Simple passwd-file based example CREATE TABLE passwd ( user_name text UNIQUE NOT NULL, pwhash text, uid int PRIMARY KEY, gid int NOT NULL, real_name text NOT NULL, home_phone text, extra_info text, home_dir text NOT NULL, shell text NOT NULL ); CREATE ROLE admin; -- Administrator CREATE ROLE bob; -- Normal user CREATE ROLE alice; -- Normal user -- Populate the table INSERT INTO passwd VALUES ('admin','xxx',0,0,'Admin','111-222-3333',null,'/root','/bin/dash'); INSERT INTO passwd VALUES ('bob','xxx',1,1,'Bob','123-456-7890',null,'/home/bob','/bin/zsh'); INSERT INTO passwd VALUES ('alice','xxx',2,1,'Alice','098-765-4321',null,'/home/alice','/bin/zsh'); -- Be sure to enable row-level security on the table ALTER TABLE passwd ENABLE ROW LEVEL SECURITY; -- Create policies -- Administrator can see all rows and add any rows CREATE POLICY admin_all ON passwd TO admin USING (true) WITH CHECK (true); -- Normal users can view all rows CREATE POLICY all_view ON passwd FOR SELECT USING (true); -- Normal users can update their own records, but -- limit which shells a normal user is allowed to set CREATE POLICY user_mod ON passwd FOR UPDATE USING (current_user = user_name) WITH CHECK ( current_user = user_name AND shell IN ('/bin/bash','/bin/sh','/bin/dash','/bin/zsh','/bin/tcsh') ); -- Allow admin all normal rights GRANT SELECT, INSERT, UPDATE, DELETE ON passwd TO admin; -- Users only get select access on public columns GRANT SELECT (user_name, uid, gid, real_name, home_phone, extra_info, home_dir, shell) ON passwd TO public; -- Allow users to update certain columns GRANT UPDATE (pwhash, real_name, home_phone, extra_info, shell) ON passwd TO public;
As with any security settings, it's important to test and ensure that the system is behaving as expected. Using the example above, this demonstrates that the permission system is working properly.
-- admin can view all rows and fields postgres=> set role admin; SET postgres=> table passwd; user_name | pwhash | uid | gid | real_name | home_phone | extra_info | home_dir | shell -----------+--------+-----+-----+-----------+--------------+------------+-------------+----------- admin | xxx | 0 | 0 | Admin | 111-222-3333 | | /root | /bin/dash bob | xxx | 1 | 1 | Bob | 123-456-7890 | | /home/bob | /bin/zsh alice | xxx | 2 | 1 | Alice | 098-765-4321 | | /home/alice | /bin/zsh (3 rows) -- Test what Alice is able to do postgres=> set role alice; SET postgres=> table passwd; ERROR: permission denied for table passwd postgres=> select user_name,real_name,home_phone,extra_info,home_dir,shell from passwd; user_name | real_name | home_phone | extra_info | home_dir | shell -----------+-----------+--------------+------------+-------------+----------- admin | Admin | 111-222-3333 | | /root | /bin/dash bob | Bob | 123-456-7890 | | /home/bob | /bin/zsh alice | Alice | 098-765-4321 | | /home/alice | /bin/zsh (3 rows) postgres=> update passwd set user_name = 'joe'; ERROR: permission denied for table passwd -- Alice is allowed to change her own real_name, but no others postgres=> update passwd set real_name = 'Alice Doe'; UPDATE 1 postgres=> update passwd set real_name = 'John Doe' where user_name = 'admin'; UPDATE 0 postgres=> update passwd set shell = '/bin/xx'; ERROR: new row violates WITH CHECK OPTION for "passwd" postgres=> delete from passwd; ERROR: permission denied for table passwd postgres=> insert into passwd (user_name) values ('xxx'); ERROR: permission denied for table passwd -- Alice can change her own password; RLS silently prevents updating other rows postgres=> update passwd set pwhash = 'abc'; UPDATE 1
All of the policies constructed thus far have been permissive policies,
meaning that when multiple policies are applied they are combined using
the “OR” Boolean operator. While permissive policies can be constructed
to only allow access to rows in the intended cases, it can be simpler to
combine permissive policies with restrictive policies (which the records
must pass and which are combined using the “AND” Boolean operator).
Building on the example above, we add a restrictive policy to require
the administrator to be connected over a local Unix socket to access the
records of the passwd
table:
CREATE POLICY admin_local_only ON passwd AS RESTRICTIVE TO admin USING (pg_catalog.inet_client_addr() IS NULL);
We can then see that an administrator connecting over a network will not see any records, due to the restrictive policy:
=> SELECT current_user; current_user -------------- admin (1 row) => select inet_client_addr(); inet_client_addr ------------------ 127.0.0.1 (1 row) => TABLE passwd; user_name | pwhash | uid | gid | real_name | home_phone | extra_info | home_dir | shell -----------+--------+-----+-----+-----------+------------+------------+----------+------- (0 rows) => UPDATE passwd set pwhash = NULL; UPDATE 0
Referential integrity checks, such as unique or primary key constraints and foreign key references, always bypass row security to ensure that data integrity is maintained. Care must be taken when developing schemas and row level policies to avoid “covert channel” leaks of information through such referential integrity checks.
In some contexts it is important to be sure that row security is
not being applied. For example, when taking a backup, it could be
disastrous if row security silently caused some rows to be omitted
from the backup. In such a situation, you can set the
row_security configuration parameter
to off
. This does not in itself bypass row security;
what it does is throw an error if any query's results would get filtered
by a policy. The reason for the error can then be investigated and
fixed.
In the examples above, the policy expressions consider only the current
values in the row to be accessed or updated. This is the simplest and
best-performing case; when possible, it's best to design row security
applications to work this way. If it is necessary to consult other rows
or other tables to make a policy decision, that can be accomplished using
sub-SELECT
s, or functions that contain SELECT
s,
in the policy expressions. Be aware however that such accesses can
create race conditions that could allow information leakage if care is
not taken. As an example, consider the following table design:
-- definition of privilege groups CREATE TABLE groups (group_id int PRIMARY KEY, group_name text NOT NULL); INSERT INTO groups VALUES (1, 'low'), (2, 'medium'), (5, 'high'); GRANT ALL ON groups TO alice; -- alice is the administrator GRANT SELECT ON groups TO public; -- definition of users' privilege levels CREATE TABLE users (user_name text PRIMARY KEY, group_id int NOT NULL REFERENCES groups); INSERT INTO users VALUES ('alice', 5), ('bob', 2), ('mallory', 2); GRANT ALL ON users TO alice; GRANT SELECT ON users TO public; -- table holding the information to be protected CREATE TABLE information (info text, group_id int NOT NULL REFERENCES groups); INSERT INTO information VALUES ('barely secret', 1), ('slightly secret', 2), ('very secret', 5); ALTER TABLE information ENABLE ROW LEVEL SECURITY; -- a row should be visible to/updatable by users whose security group_id is -- greater than or equal to the row's group_id CREATE POLICY fp_s ON information FOR SELECT USING (group_id <= (SELECT group_id FROM users WHERE user_name = current_user)); CREATE POLICY fp_u ON information FOR UPDATE USING (group_id <= (SELECT group_id FROM users WHERE user_name = current_user)); -- we rely only on RLS to protect the information table GRANT ALL ON information TO public;
Now suppose that alice
wishes to change the “slightly
secret” information, but decides that mallory
should not
be trusted with the new content of that row, so she does:
BEGIN; UPDATE users SET group_id = 1 WHERE user_name = 'mallory'; UPDATE information SET info = 'secret from mallory' WHERE group_id = 2; COMMIT;
That looks safe; there is no window wherein mallory
should be
able to see the “secret from mallory” string. However, there is
a race condition here. If mallory
is concurrently doing,
say,
SELECT * FROM information WHERE group_id = 2 FOR UPDATE;
and her transaction is in READ COMMITTED
mode, it is possible
for her to see “secret from mallory”. That happens if her
transaction reaches the information
row just
after alice
's does. It blocks waiting
for alice
's transaction to commit, then fetches the updated
row contents thanks to the FOR UPDATE
clause. However, it
does not fetch an updated row for the
implicit SELECT
from users
, because that
sub-SELECT
did not have FOR UPDATE
; instead
the users
row is read with the snapshot taken at the start
of the query. Therefore, the policy expression tests the old value
of mallory
's privilege level and allows her to see the
updated row.
There are several ways around this problem. One simple answer is to use
SELECT ... FOR SHARE
in sub-SELECT
s in row
security policies. However, that requires granting UPDATE
privilege on the referenced table (here users
) to the
affected users, which might be undesirable. (But another row security
policy could be applied to prevent them from actually exercising that
privilege; or the sub-SELECT
could be embedded into a security
definer function.) Also, heavy concurrent use of row share locks on the
referenced table could pose a performance problem, especially if updates
of it are frequent. Another solution, practical if updates of the
referenced table are infrequent, is to take an
ACCESS EXCLUSIVE
lock on the
referenced table when updating it, so that no concurrent transactions
could be examining old row values. Or one could just wait for all
concurrent transactions to end after committing an update of the
referenced table and before making changes that rely on the new security
situation.
For additional details see CREATE POLICY and ALTER TABLE.
A PostgreSQL database cluster contains one or more named databases. Roles and a few other object types are shared across the entire cluster. A client connection to the server can only access data in a single database, the one specified in the connection request.
Users of a cluster do not necessarily have the privilege to access every
database in the cluster. Sharing of role names means that there
cannot be different roles named, say, joe
in two databases
in the same cluster; but the system can be configured to allow
joe
access to only some of the databases.
A database contains one or more named schemas, which
in turn contain tables. Schemas also contain other kinds of named
objects, including data types, functions, and operators. The same
object name can be used in different schemas without conflict; for
example, both schema1
and myschema
can
contain tables named mytable
. Unlike databases,
schemas are not rigidly separated: a user can access objects in any
of the schemas in the database they are connected to, if they have
privileges to do so.
There are several reasons why one might want to use schemas:
To allow many users to use one database without interfering with each other.
To organize database objects into logical groups to make them more manageable.
Third-party applications can be put into separate schemas so they do not collide with the names of other objects.
Schemas are analogous to directories at the operating system level, except that schemas cannot be nested.
To create a schema, use the CREATE SCHEMA command. Give the schema a name of your choice. For example:
CREATE SCHEMA myschema;
To create or access objects in a schema, write a qualified name consisting of the schema name and table name separated by a dot:
schema
.
table
This works anywhere a table name is expected, including the table modification commands and the data access commands discussed in the following chapters. (For brevity we will speak of tables only, but the same ideas apply to other kinds of named objects, such as types and functions.)
Actually, the even more general syntax
database
.
schema
.
table
can be used too, but at present this is just for pro forma compliance with the SQL standard. If you write a database name, it must be the same as the database you are connected to.
So to create a table in the new schema, use:
CREATE TABLE myschema.mytable ( ... );
To drop a schema if it's empty (all objects in it have been dropped), use:
DROP SCHEMA myschema;
To drop a schema including all contained objects, use:
DROP SCHEMA myschema CASCADE;
See Section 5.14 for a description of the general mechanism behind this.
Often you will want to create a schema owned by someone else (since this is one of the ways to restrict the activities of your users to well-defined namespaces). The syntax for that is:
CREATE SCHEMAschema_name
AUTHORIZATIONuser_name
;
You can even omit the schema name, in which case the schema name will be the same as the user name. See Section 5.9.6 for how this can be useful.
Schema names beginning with pg_
are reserved for
system purposes and cannot be created by users.
In the previous sections we created tables without specifying any schema names. By default such tables (and other objects) are automatically put into a schema named “public”. Every new database contains such a schema. Thus, the following are equivalent:
CREATE TABLE products ( ... );
and:
CREATE TABLE public.products ( ... );
Qualified names are tedious to write, and it's often best not to wire a particular schema name into applications anyway. Therefore tables are often referred to by unqualified names, which consist of just the table name. The system determines which table is meant by following a search path, which is a list of schemas to look in. The first matching table in the search path is taken to be the one wanted. If there is no match in the search path, an error is reported, even if matching table names exist in other schemas in the database.
The ability to create like-named objects in different schemas complicates
writing a query that references precisely the same objects every time. It
also opens up the potential for users to change the behavior of other
users' queries, maliciously or accidentally. Due to the prevalence of
unqualified names in queries and their use
in PostgreSQL internals, adding a schema
to search_path
effectively trusts all users having
CREATE
privilege on that schema. When you run an
ordinary query, a malicious user able to create objects in a schema of
your search path can take control and execute arbitrary SQL functions as
though you executed them.
The first schema named in the search path is called the current schema.
Aside from being the first schema searched, it is also the schema in
which new tables will be created if the CREATE TABLE
command does not specify a schema name.
To show the current search path, use the following command:
SHOW search_path;
In the default setup this returns:
search_path -------------- "$user", public
The first element specifies that a schema with the same name as the current user is to be searched. If no such schema exists, the entry is ignored. The second element refers to the public schema that we have seen already.
The first schema in the search path that exists is the default location for creating new objects. That is the reason that by default objects are created in the public schema. When objects are referenced in any other context without schema qualification (table modification, data modification, or query commands) the search path is traversed until a matching object is found. Therefore, in the default configuration, any unqualified access again can only refer to the public schema.
To put our new schema in the path, we use:
SET search_path TO myschema,public;
(We omit the $user
here because we have no
immediate need for it.) And then we can access the table without
schema qualification:
DROP TABLE mytable;
Also, since myschema
is the first element in
the path, new objects would by default be created in it.
We could also have written:
SET search_path TO myschema;
Then we no longer have access to the public schema without explicit qualification. There is nothing special about the public schema except that it exists by default. It can be dropped, too.
See also Section 9.26 for other ways to manipulate the schema search path.
The search path works in the same way for data type names, function names, and operator names as it does for table names. Data type and function names can be qualified in exactly the same way as table names. If you need to write a qualified operator name in an expression, there is a special provision: you must write
OPERATOR(
schema
.
operator
)
This is needed to avoid syntactic ambiguity. An example is:
SELECT 3 OPERATOR(pg_catalog.+) 4;
In practice one usually relies on the search path for operators, so as not to have to write anything so ugly as that.
By default, users cannot access any objects in schemas they do not
own. To allow that, the owner of the schema must grant the
USAGE
privilege on the schema. To allow users
to make use of the objects in the schema, additional privileges
might need to be granted, as appropriate for the object.
A user can also be allowed to create objects in someone else's
schema. To allow that, the CREATE
privilege on
the schema needs to be granted. Note that by default, everyone
has CREATE
and USAGE
privileges on
the schema
public
. This allows all users that are able to
connect to a given database to create objects in its
public
schema.
Some usage patterns call for
revoking that privilege:
REVOKE CREATE ON SCHEMA public FROM PUBLIC;
(The first “public” is the schema, the second “public” means “every user”. In the first sense it is an identifier, in the second sense it is a key word, hence the different capitalization; recall the guidelines from Section 4.1.1.)
In addition to public
and user-created schemas, each
database contains a pg_catalog
schema, which contains
the system tables and all the built-in data types, functions, and
operators. pg_catalog
is always effectively part of
the search path. If it is not named explicitly in the path then
it is implicitly searched before searching the path's
schemas. This ensures that built-in names will always be
findable. However, you can explicitly place
pg_catalog
at the end of your search path if you
prefer to have user-defined names override built-in names.
Since system table names begin with pg_
, it is best to
avoid such names to ensure that you won't suffer a conflict if some
future version defines a system table named the same as your
table. (With the default search path, an unqualified reference to
your table name would then be resolved as the system table instead.)
System tables will continue to follow the convention of having
names beginning with pg_
, so that they will not
conflict with unqualified user-table names so long as users avoid
the pg_
prefix.
Schemas can be used to organize your data in many ways.
A secure schema usage pattern prevents untrusted
users from changing the behavior of other users' queries. When a database
does not use a secure schema usage pattern, users wishing to securely
query that database would take protective action at the beginning of each
session. Specifically, they would begin each session by
setting search_path
to the empty string or otherwise
removing non-superuser-writable schemas
from search_path
. There are a few usage patterns
easily supported by the default configuration:
Constrain ordinary users to user-private schemas. To implement this,
issue REVOKE CREATE ON SCHEMA public FROM PUBLIC
,
and create a schema for each user with the same name as that user.
Recall that the default search path starts
with $user
, which resolves to the user name.
Therefore, if each user has a separate schema, they access their own
schemas by default. After adopting this pattern in a database where
untrusted users had already logged in, consider auditing the public
schema for objects named like objects in
schema pg_catalog
. This pattern is a secure schema
usage pattern unless an untrusted user is the database owner or holds
the CREATEROLE
privilege, in which case no secure
schema usage pattern exists.
Remove the public schema from the default search path, by modifying
postgresql.conf
or by issuing ALTER ROLE ALL SET search_path =
"$user"
. Everyone retains the ability to create objects in
the public schema, but only qualified names will choose those objects.
While qualified table references are fine, calls to functions in the
public schema will be unsafe or
unreliable. If you create functions or extensions in the public
schema, use the first pattern instead. Otherwise, like the first
pattern, this is secure unless an untrusted user is the database owner
or holds the CREATEROLE
privilege.
Keep the default. All users access the public schema implicitly. This simulates the situation where schemas are not available at all, giving a smooth transition from the non-schema-aware world. However, this is never a secure pattern. It is acceptable only when the database has a single user or a few mutually-trusting users.
For any pattern, to install shared applications (tables to be used by everyone, additional functions provided by third parties, etc.), put them into separate schemas. Remember to grant appropriate privileges to allow the other users to access them. Users can then refer to these additional objects by qualifying the names with a schema name, or they can put the additional schemas into their search path, as they choose.
In the SQL standard, the notion of objects in the same schema
being owned by different users does not exist. Moreover, some
implementations do not allow you to create schemas that have a
different name than their owner. In fact, the concepts of schema
and user are nearly equivalent in a database system that
implements only the basic schema support specified in the
standard. Therefore, many users consider qualified names to
really consist of
.
This is how PostgreSQL will effectively
behave if you create a per-user schema for every user.
user_name
.table_name
Also, there is no concept of a public
schema in the
SQL standard. For maximum conformance to the standard, you should
not use the public
schema.
Of course, some SQL database systems might not implement schemas at all, or provide namespace support by allowing (possibly limited) cross-database access. If you need to work with those systems, then maximum portability would be achieved by not using schemas at all.
PostgreSQL implements table inheritance, which can be a useful tool for database designers. (SQL:1999 and later define a type inheritance feature, which differs in many respects from the features described here.)
Let's start with an example: suppose we are trying to build a data
model for cities. Each state has many cities, but only one
capital. We want to be able to quickly retrieve the capital city
for any particular state. This can be done by creating two tables,
one for state capitals and one for cities that are not
capitals. However, what happens when we want to ask for data about
a city, regardless of whether it is a capital or not? The
inheritance feature can help to resolve this problem. We define the
capitals
table so that it inherits from
cities
:
CREATE TABLE cities ( name text, population float, elevation int -- in feet ); CREATE TABLE capitals ( state char(2) ) INHERITS (cities);
In this case, the capitals
table inherits
all the columns of its parent table, cities
. State
capitals also have an extra column, state
, that shows
their state.
In PostgreSQL, a table can inherit from zero or more other tables, and a query can reference either all rows of a table or all rows of a table plus all of its descendant tables. The latter behavior is the default. For example, the following query finds the names of all cities, including state capitals, that are located at an elevation over 500 feet:
SELECT name, elevation FROM cities WHERE elevation > 500;
Given the sample data from the PostgreSQL tutorial (see Section 2.1), this returns:
name | elevation -----------+----------- Las Vegas | 2174 Mariposa | 1953 Madison | 845
On the other hand, the following query finds all the cities that are not state capitals and are situated at an elevation over 500 feet:
SELECT name, elevation FROM ONLY cities WHERE elevation > 500; name | elevation -----------+----------- Las Vegas | 2174 Mariposa | 1953
Here the ONLY
keyword indicates that the query
should apply only to cities
, and not any tables
below cities
in the inheritance hierarchy. Many
of the commands that we have already discussed —
SELECT
, UPDATE
and
DELETE
— support the
ONLY
keyword.
You can also write the table name with a trailing *
to explicitly specify that descendant tables are included:
SELECT name, elevation FROM cities* WHERE elevation > 500;
Writing *
is not necessary, since this behavior is always
the default. However, this syntax is still supported for
compatibility with older releases where the default could be changed.
In some cases you might wish to know which table a particular row
originated from. There is a system column called
tableoid
in each table which can tell you the
originating table:
SELECT c.tableoid, c.name, c.elevation FROM cities c WHERE c.elevation > 500;
which returns:
tableoid | name | elevation ----------+-----------+----------- 139793 | Las Vegas | 2174 139793 | Mariposa | 1953 139798 | Madison | 845
(If you try to reproduce this example, you will probably get
different numeric OIDs.) By doing a join with
pg_class
you can see the actual table names:
SELECT p.relname, c.name, c.elevation FROM cities c, pg_class p WHERE c.elevation > 500 AND c.tableoid = p.oid;
which returns:
relname | name | elevation ----------+-----------+----------- cities | Las Vegas | 2174 cities | Mariposa | 1953 capitals | Madison | 845
Another way to get the same effect is to use the regclass
alias type, which will print the table OID symbolically:
SELECT c.tableoid::regclass, c.name, c.elevation FROM cities c WHERE c.elevation > 500;
Inheritance does not automatically propagate data from
INSERT
or COPY
commands to
other tables in the inheritance hierarchy. In our example, the
following INSERT
statement will fail:
INSERT INTO cities (name, population, elevation, state) VALUES ('Albany', NULL, NULL, 'NY');
We might hope that the data would somehow be routed to the
capitals
table, but this does not happen:
INSERT
always inserts into exactly the table
specified. In some cases it is possible to redirect the insertion
using a rule (see Chapter 41). However that does not
help for the above case because the cities
table
does not contain the column state
, and so the
command will be rejected before the rule can be applied.
All check constraints and not-null constraints on a parent table are
automatically inherited by its children, unless explicitly specified
otherwise with NO INHERIT
clauses. Other types of constraints
(unique, primary key, and foreign key constraints) are not inherited.
A table can inherit from more than one parent table, in which case it has the union of the columns defined by the parent tables. Any columns declared in the child table's definition are added to these. If the same column name appears in multiple parent tables, or in both a parent table and the child's definition, then these columns are “merged” so that there is only one such column in the child table. To be merged, columns must have the same data types, else an error is raised. Inheritable check constraints and not-null constraints are merged in a similar fashion. Thus, for example, a merged column will be marked not-null if any one of the column definitions it came from is marked not-null. Check constraints are merged if they have the same name, and the merge will fail if their conditions are different.
Table inheritance is typically established when the child table is
created, using the INHERITS
clause of the
CREATE TABLE
statement.
Alternatively, a table which is already defined in a compatible way can
have a new parent relationship added, using the INHERIT
variant of ALTER TABLE
.
To do this the new child table must already include columns with
the same names and types as the columns of the parent. It must also include
check constraints with the same names and check expressions as those of the
parent. Similarly an inheritance link can be removed from a child using the
NO INHERIT
variant of ALTER TABLE
.
Dynamically adding and removing inheritance links like this can be useful
when the inheritance relationship is being used for table
partitioning (see Section 5.11).
One convenient way to create a compatible table that will later be made
a new child is to use the LIKE
clause in CREATE
TABLE
. This creates a new table with the same columns as
the source table. If there are any CHECK
constraints defined on the source table, the INCLUDING
CONSTRAINTS
option to LIKE
should be
specified, as the new child must have constraints matching the parent
to be considered compatible.
A parent table cannot be dropped while any of its children remain. Neither
can columns or check constraints of child tables be dropped or altered
if they are inherited
from any parent tables. If you wish to remove a table and all of its
descendants, one easy way is to drop the parent table with the
CASCADE
option (see Section 5.14).
ALTER TABLE
will
propagate any changes in column data definitions and check
constraints down the inheritance hierarchy. Again, dropping
columns that are depended on by other tables is only possible when using
the CASCADE
option. ALTER
TABLE
follows the same rules for duplicate column merging
and rejection that apply during CREATE TABLE
.
Inherited queries perform access permission checks on the parent table
only. Thus, for example, granting UPDATE
permission on
the cities
table implies permission to update rows in
the capitals
table as well, when they are
accessed through cities
. This preserves the appearance
that the data is (also) in the parent table. But
the capitals
table could not be updated directly
without an additional grant. In a similar way, the parent table's row
security policies (see Section 5.8) are applied to
rows coming from child tables during an inherited query. A child table's
policies, if any, are applied only when it is the table explicitly named
in the query; and in that case, any policies attached to its parent(s) are
ignored.
Foreign tables (see Section 5.12) can also be part of inheritance hierarchies, either as parent or child tables, just as regular tables can be. If a foreign table is part of an inheritance hierarchy then any operations not supported by the foreign table are not supported on the whole hierarchy either.
Note that not all SQL commands are able to work on
inheritance hierarchies. Commands that are used for data querying,
data modification, or schema modification
(e.g., SELECT
, UPDATE
, DELETE
,
most variants of ALTER TABLE
, but
not INSERT
or ALTER TABLE ...
RENAME
) typically default to including child tables and
support the ONLY
notation to exclude them.
Commands that do database maintenance and tuning
(e.g., REINDEX
, VACUUM
)
typically only work on individual, physical tables and do not
support recursing over inheritance hierarchies. The respective
behavior of each individual command is documented in its reference
page (SQL Commands).
A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign key constraints only apply to single tables, not to their inheritance children. This is true on both the referencing and referenced sides of a foreign key constraint. Thus, in the terms of the above example:
If we declared cities
.name
to be
UNIQUE
or a PRIMARY KEY
, this would not stop the
capitals
table from having rows with names duplicating
rows in cities
. And those duplicate rows would by
default show up in queries from cities
. In fact, by
default capitals
would have no unique constraint at all,
and so could contain multiple rows with the same name.
You could add a unique constraint to capitals
, but this
would not prevent duplication compared to cities
.
Similarly, if we were to specify that
cities
.name
REFERENCES
some
other table, this constraint would not automatically propagate to
capitals
. In this case you could work around it by
manually adding the same REFERENCES
constraint to
capitals
.
Specifying that another table's column REFERENCES
cities(name)
would allow the other table to contain city names, but
not capital names. There is no good workaround for this case.
Some functionality not implemented for inheritance hierarchies is implemented for declarative partitioning. Considerable care is needed in deciding whether partitioning with legacy inheritance is useful for your application.
PostgreSQL supports basic table partitioning. This section describes why and how to implement partitioning as part of your database design.
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. Partitioning effectively substitutes for the upper tree levels of indexes, making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by using a sequential scan of that partition instead of using an index, which would require random-access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing
partitions, if the usage pattern is accounted for in the
partitioning design. Dropping an individual partition
using DROP TABLE
, or doing ALTER TABLE
DETACH PARTITION
, is far faster than a bulk
operation. These commands also entirely avoid the
VACUUM
overhead caused by a bulk DELETE
.
Seldom-used data can be migrated to cheaper and slower storage media.
These benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
PostgreSQL offers built-in support for the following forms of partitioning:
The table is partitioned into “ranges” defined
by a key column or set of columns, with no overlap between
the ranges of values assigned to different partitions. For
example, one might partition by date ranges, or by ranges of
identifiers for particular business objects.
Each range's bounds are understood as being inclusive at the
lower end and exclusive at the upper end. For example, if one
partition's range is from 1
to 10
, and the next one's range is
from 10
to 20
, then
value 10
belongs to the second partition not
the first.
The table is partitioned by explicitly listing which key value(s) appear in each partition.
The table is partitioned by specifying a modulus and a remainder for each partition. Each partition will hold the rows for which the hash value of the partition key divided by the specified modulus will produce the specified remainder.
If your application needs to use other forms of partitioning not listed
above, alternative methods such as inheritance and
UNION ALL
views can be used instead. Such methods
offer flexibility but do not have some of the performance benefits
of built-in declarative partitioning.
PostgreSQL allows you to declare that a table is divided into partitions. The table that is divided is referred to as a partitioned table. The declaration includes the partitioning method as described above, plus a list of columns or expressions to be used as the partition key.
The partitioned table itself is a “virtual” table having no storage of its own. Instead, the storage belongs to partitions, which are otherwise-ordinary tables associated with the partitioned table. Each partition stores a subset of the data as defined by its partition bounds. All rows inserted into a partitioned table will be routed to the appropriate one of the partitions based on the values of the partition key column(s). Updating the partition key of a row will cause it to be moved into a different partition if it no longer satisfies the partition bounds of its original partition.
Partitions may themselves be defined as partitioned tables, resulting in sub-partitioning. Although all partitions must have the same columns as their partitioned parent, partitions may have their own indexes, constraints and default values, distinct from those of other partitions. See CREATE TABLE for more details on creating partitioned tables and partitions.
It is not possible to turn a regular table into a partitioned table or
vice versa. However, it is possible to add an existing regular or
partitioned table as a partition of a partitioned table, or remove a
partition from a partitioned table turning it into a standalone table;
this can simplify and speed up many maintenance processes.
See ALTER TABLE to learn more about the
ATTACH PARTITION
and DETACH PARTITION
sub-commands.
Partitions can also be foreign tables, although considerable care is needed because it is then the user's responsibility that the contents of the foreign table satisfy the partitioning rule. There are some other restrictions as well. See CREATE FOREIGN TABLE for more information.
Suppose we are constructing a database for a large ice cream company. The company measures peak temperatures every day as well as ice cream sales in each region. Conceptually, we want a table like:
CREATE TABLE measurement ( city_id int not null, logdate date not null, peaktemp int, unitsales int );
We know that most queries will access just the last week's, month's or quarter's data, since the main use of this table will be to prepare online reports for management. To reduce the amount of old data that needs to be stored, we decide to keep only the most recent 3 years worth of data. At the beginning of each month we will remove the oldest month's data. In this situation we can use partitioning to help us meet all of our different requirements for the measurements table.
To use declarative partitioning in this case, use the following steps:
Create the measurement
table as a partitioned
table by specifying the PARTITION BY
clause, which
includes the partitioning method (RANGE
in this
case) and the list of column(s) to use as the partition key.
CREATE TABLE measurement ( city_id int not null, logdate date not null, peaktemp int, unitsales int ) PARTITION BY RANGE (logdate);
Create partitions. Each partition's definition must specify bounds that correspond to the partitioning method and partition key of the parent. Note that specifying bounds such that the new partition's values would overlap with those in one or more existing partitions will cause an error.
Partitions thus created are in every way normal PostgreSQL tables (or, possibly, foreign tables). It is possible to specify a tablespace and storage parameters for each partition separately.
For our example, each partition should hold one month's worth of data, to match the requirement of deleting one month's data at a time. So the commands might look like:
CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM ('2006-02-01') TO ('2006-03-01'); CREATE TABLE measurement_y2006m03 PARTITION OF measurement FOR VALUES FROM ('2006-03-01') TO ('2006-04-01'); ... CREATE TABLE measurement_y2007m11 PARTITION OF measurement FOR VALUES FROM ('2007-11-01') TO ('2007-12-01'); CREATE TABLE measurement_y2007m12 PARTITION OF measurement FOR VALUES FROM ('2007-12-01') TO ('2008-01-01') TABLESPACE fasttablespace; CREATE TABLE measurement_y2008m01 PARTITION OF measurement FOR VALUES FROM ('2008-01-01') TO ('2008-02-01') WITH (parallel_workers = 4) TABLESPACE fasttablespace;
(Recall that adjacent partitions can share a bound value, since range upper bounds are treated as exclusive bounds.)
If you wish to implement sub-partitioning, again specify the
PARTITION BY
clause in the commands used to create
individual partitions, for example:
CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM ('2006-02-01') TO ('2006-03-01') PARTITION BY RANGE (peaktemp);
After creating partitions of measurement_y2006m02
,
any data inserted into measurement
that is mapped to
measurement_y2006m02
(or data that is
directly inserted into measurement_y2006m02
,
which is allowed provided its partition constraint is satisfied)
will be further redirected to one of its
partitions based on the peaktemp
column. The partition
key specified may overlap with the parent's partition key, although
care should be taken when specifying the bounds of a sub-partition
such that the set of data it accepts constitutes a subset of what
the partition's own bounds allow; the system does not try to check
whether that's really the case.
Inserting data into the parent table that does not map to one of the existing partitions will cause an error; an appropriate partition must be added manually.
It is not necessary to manually create table constraints describing the partition boundary conditions for partitions. Such constraints will be created automatically.
Create an index on the key column(s), as well as any other indexes you might want, on the partitioned table. (The key index is not strictly necessary, but in most scenarios it is helpful.) This automatically creates a matching index on each partition, and any partitions you create or attach later will also have such an index. An index or unique constraint declared on a partitioned table is “virtual” in the same way that the partitioned table is: the actual data is in child indexes on the individual partition tables.
CREATE INDEX ON measurement (logdate);
Ensure that the enable_partition_pruning
configuration parameter is not disabled in postgresql.conf
.
If it is, queries will not be optimized as desired.
In the above example we would be creating a new partition each month, so it might be wise to write a script that generates the required DDL automatically.
Normally the set of partitions established when initially defining the table is not intended to remain static. It is common to want to remove partitions holding old data and periodically add new partitions for new data. One of the most important advantages of partitioning is precisely that it allows this otherwise painful task to be executed nearly instantaneously by manipulating the partition structure, rather than physically moving large amounts of data around.
The simplest option for removing old data is to drop the partition that is no longer necessary:
DROP TABLE measurement_y2006m02;
This can very quickly delete millions of records because it doesn't have
to individually delete every record. Note however that the above command
requires taking an ACCESS EXCLUSIVE
lock on the parent
table.
Another option that is often preferable is to remove the partition from the partitioned table but retain access to it as a table in its own right. This has two forms:
ALTER TABLE measurement DETACH PARTITION measurement_y2006m02; ALTER TABLE measurement DETACH PARTITION measurement_y2006m02 CONCURRENTLY;
These allow further operations to be performed on the data before
it is dropped. For example, this is often a useful time to back up
the data using COPY
, pg_dump, or
similar tools. It might also be a useful time to aggregate data
into smaller formats, perform other data manipulations, or run
reports. The first form of the command requires an
ACCESS EXCLUSIVE
lock on the parent table.
Adding the CONCURRENTLY
qualifier as in the second
form allows the detach operation to require only
SHARE UPDATE EXCLUSIVE
lock on the parent table, but see
ALTER TABLE ... DETACH PARTITION
for details on the restrictions.
Similarly we can add a new partition to handle new data. We can create an empty partition in the partitioned table just as the original partitions were created above:
CREATE TABLE measurement_y2008m02 PARTITION OF measurement FOR VALUES FROM ('2008-02-01') TO ('2008-03-01') TABLESPACE fasttablespace;
As an alternative, it is sometimes more convenient to create the
new table outside the partition structure, and attach it as a
partition later. This allows new data to be loaded, checked, and
transformed prior to it appearing in the partitioned table.
Moreover, the ATTACH PARTITION
operation requires
only SHARE UPDATE EXCLUSIVE
lock on the
partitioned table, as opposed to the ACCESS
EXCLUSIVE
lock that is required by CREATE TABLE
... PARTITION OF
, so it is more friendly to concurrent
operations on the partitioned table.
The CREATE TABLE ... LIKE
option is helpful
to avoid tediously repeating the parent table's definition:
CREATE TABLE measurement_y2008m02 (LIKE measurement INCLUDING DEFAULTS INCLUDING CONSTRAINTS) TABLESPACE fasttablespace; ALTER TABLE measurement_y2008m02 ADD CONSTRAINT y2008m02 CHECK ( logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01' ); \copy measurement_y2008m02 from 'measurement_y2008m02' -- possibly some other data preparation work ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02 FOR VALUES FROM ('2008-02-01') TO ('2008-03-01' );
Before running the ATTACH PARTITION
command, it is
recommended to create a CHECK
constraint on the table to
be attached that matches the expected partition constraint, as
illustrated above. That way, the system will be able to skip the scan
which is otherwise needed to validate the implicit
partition constraint. Without the CHECK
constraint,
the table will be scanned to validate the partition constraint while
holding an ACCESS EXCLUSIVE
lock on that partition.
It is recommended to drop the now-redundant CHECK
constraint after the ATTACH PARTITION
is complete. If
the table being attached is itself a partitioned table, then each of its
sub-partitions will be recursively locked and scanned until either a
suitable CHECK
constraint is encountered or the leaf
partitions are reached.
Similarly, if the partitioned table has a DEFAULT
partition, it is recommended to create a CHECK
constraint which excludes the to-be-attached partition's constraint. If
this is not done then the DEFAULT
partition will be
scanned to verify that it contains no records which should be located in
the partition being attached. This operation will be performed whilst
holding an ACCESS EXCLUSIVE
lock on the
DEFAULT
partition. If the DEFAULT
partition
is itself a partitioned table, then each of its partitions will be
recursively checked in the same way as the table being attached, as
mentioned above.
As explained above, it is possible to create indexes on partitioned tables
so that they are applied automatically to the entire hierarchy.
This is very
convenient, as not only will the existing partitions become indexed, but
also any partitions that are created in the future will. One limitation is
that it's not possible to use the CONCURRENTLY
qualifier when creating such a partitioned index. To avoid long lock
times, it is possible to use CREATE INDEX ON ONLY
the partitioned table; such an index is marked invalid, and the partitions
do not get the index applied automatically. The indexes on partitions can
be created individually using CONCURRENTLY
, and then
attached to the index on the parent using
ALTER INDEX .. ATTACH PARTITION
. Once indexes for all
partitions are attached to the parent index, the parent index is marked
valid automatically. Example:
CREATE INDEX measurement_usls_idx ON ONLY measurement (unitsales); CREATE INDEX CONCURRENTLY measurement_usls_200602_idx ON measurement_y2006m02 (unitsales); ALTER INDEX measurement_usls_idx ATTACH PARTITION measurement_usls_200602_idx; ...
This technique can be used with UNIQUE
and
PRIMARY KEY
constraints too; the indexes are created
implicitly when the constraint is created. Example:
ALTER TABLE ONLY measurement ADD UNIQUE (city_id, logdate); ALTER TABLE measurement_y2006m02 ADD UNIQUE (city_id, logdate); ALTER INDEX measurement_city_id_logdate_key ATTACH PARTITION measurement_y2006m02_city_id_logdate_key; ...
The following limitations apply to partitioned tables:
To create a unique or primary key constraint on a partitioned table, the partition keys must not include any expressions or function calls and the constraint's columns must include all of the partition key columns. This limitation exists because the individual indexes making up the constraint can only directly enforce uniqueness within their own partitions; therefore, the partition structure itself must guarantee that there are not duplicates in different partitions.
There is no way to create an exclusion constraint spanning the whole partitioned table. It is only possible to put such a constraint on each leaf partition individually. Again, this limitation stems from not being able to enforce cross-partition restrictions.
BEFORE ROW
triggers on INSERT
cannot change which partition is the final destination for a new row.
Mixing temporary and permanent relations in the same partition tree is not allowed. Hence, if the partitioned table is permanent, so must be its partitions and likewise if the partitioned table is temporary. When using temporary relations, all members of the partition tree have to be from the same session.
Individual partitions are linked to their partitioned table using inheritance behind-the-scenes. However, it is not possible to use all of the generic features of inheritance with declaratively partitioned tables or their partitions, as discussed below. Notably, a partition cannot have any parents other than the partitioned table it is a partition of, nor can a table inherit from both a partitioned table and a regular table. That means partitioned tables and their partitions never share an inheritance hierarchy with regular tables.
Since a partition hierarchy consisting of the partitioned table and its
partitions is still an inheritance hierarchy,
tableoid
and all the normal rules of
inheritance apply as described in Section 5.10, with
a few exceptions:
Partitions cannot have columns that are not present in the parent. It
is not possible to specify columns when creating partitions with
CREATE TABLE
, nor is it possible to add columns to
partitions after-the-fact using ALTER TABLE
.
Tables may be added as a partition with ALTER TABLE
... ATTACH PARTITION
only if their columns exactly match
the parent.
Both CHECK
and NOT NULL
constraints of a partitioned table are always inherited by all its
partitions. CHECK
constraints that are marked
NO INHERIT
are not allowed to be created on
partitioned tables.
You cannot drop a NOT NULL
constraint on a
partition's column if the same constraint is present in the parent
table.
Using ONLY
to add or drop a constraint on only
the partitioned table is supported as long as there are no
partitions. Once partitions exist, using ONLY
will result in an error for any constraints other than
UNIQUE
and PRIMARY KEY
.
Instead, constraints on the partitions
themselves can be added and (if they are not present in the parent
table) dropped.
As a partitioned table does not have any data itself, attempts to use
TRUNCATE
ONLY
on a partitioned
table will always return an error.
While the built-in declarative partitioning is suitable for most common use cases, there are some circumstances where a more flexible approach may be useful. Partitioning can be implemented using table inheritance, which allows for several features not supported by declarative partitioning, such as:
For declarative partitioning, partitions must have exactly the same set of columns as the partitioned table, whereas with table inheritance, child tables may have extra columns not present in the parent.
Table inheritance allows for multiple inheritance.
Declarative partitioning only supports range, list and hash partitioning, whereas table inheritance allows data to be divided in a manner of the user's choosing. (Note, however, that if constraint exclusion is unable to prune child tables effectively, query performance might be poor.)
This example builds a partitioning structure equivalent to the declarative partitioning example above. Use the following steps:
Create the “root” table, from which all of the
“child” tables will inherit. This table will contain no data. Do not
define any check constraints on this table, unless you intend them
to be applied equally to all child tables. There is no point in
defining any indexes or unique constraints on it, either. For our
example, the root table is the measurement
table as originally defined:
CREATE TABLE measurement ( city_id int not null, logdate date not null, peaktemp int, unitsales int );
Create several “child” tables that each inherit from the root table. Normally, these tables will not add any columns to the set inherited from the root. Just as with declarative partitioning, these tables are in every way normal PostgreSQL tables (or foreign tables).
CREATE TABLE measurement_y2006m02 () INHERITS (measurement); CREATE TABLE measurement_y2006m03 () INHERITS (measurement); ... CREATE TABLE measurement_y2007m11 () INHERITS (measurement); CREATE TABLE measurement_y2007m12 () INHERITS (measurement); CREATE TABLE measurement_y2008m01 () INHERITS (measurement);
Add non-overlapping table constraints to the child tables to define the allowed key values in each.
Typical examples would be:
CHECK ( x = 1 ) CHECK ( county IN ( 'Oxfordshire', 'Buckinghamshire', 'Warwickshire' )) CHECK ( outletID >= 100 AND outletID < 200 )
Ensure that the constraints guarantee that there is no overlap between the key values permitted in different child tables. A common mistake is to set up range constraints like:
CHECK ( outletID BETWEEN 100 AND 200 ) CHECK ( outletID BETWEEN 200 AND 300 )
This is wrong since it is not clear which child table the key value 200 belongs in. Instead, ranges should be defined in this style:
CREATE TABLE measurement_y2006m02 ( CHECK ( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' ) ) INHERITS (measurement); CREATE TABLE measurement_y2006m03 ( CHECK ( logdate >= DATE '2006-03-01' AND logdate < DATE '2006-04-01' ) ) INHERITS (measurement); ... CREATE TABLE measurement_y2007m11 ( CHECK ( logdate >= DATE '2007-11-01' AND logdate < DATE '2007-12-01' ) ) INHERITS (measurement); CREATE TABLE measurement_y2007m12 ( CHECK ( logdate >= DATE '2007-12-01' AND logdate < DATE '2008-01-01' ) ) INHERITS (measurement); CREATE TABLE measurement_y2008m01 ( CHECK ( logdate >= DATE '2008-01-01' AND logdate < DATE '2008-02-01' ) ) INHERITS (measurement);
For each child table, create an index on the key column(s), as well as any other indexes you might want.
CREATE INDEX measurement_y2006m02_logdate ON measurement_y2006m02 (logdate); CREATE INDEX measurement_y2006m03_logdate ON measurement_y2006m03 (logdate); CREATE INDEX measurement_y2007m11_logdate ON measurement_y2007m11 (logdate); CREATE INDEX measurement_y2007m12_logdate ON measurement_y2007m12 (logdate); CREATE INDEX measurement_y2008m01_logdate ON measurement_y2008m01 (logdate);
We want our application to be able to say INSERT INTO
measurement ...
and have the data be redirected into the
appropriate child table. We can arrange that by attaching
a suitable trigger function to the root table.
If data will be added only to the latest child, we can
use a very simple trigger function:
CREATE OR REPLACE FUNCTION measurement_insert_trigger() RETURNS TRIGGER AS $$ BEGIN INSERT INTO measurement_y2008m01 VALUES (NEW.*); RETURN NULL; END; $$ LANGUAGE plpgsql;
After creating the function, we create a trigger which calls the trigger function:
CREATE TRIGGER insert_measurement_trigger BEFORE INSERT ON measurement FOR EACH ROW EXECUTE FUNCTION measurement_insert_trigger();
We must redefine the trigger function each month so that it always inserts into the current child table. The trigger definition does not need to be updated, however.
We might want to insert data and have the server automatically locate the child table into which the row should be added. We could do this with a more complex trigger function, for example:
CREATE OR REPLACE FUNCTION measurement_insert_trigger() RETURNS TRIGGER AS $$ BEGIN IF ( NEW.logdate >= DATE '2006-02-01' AND NEW.logdate < DATE '2006-03-01' ) THEN INSERT INTO measurement_y2006m02 VALUES (NEW.*); ELSIF ( NEW.logdate >= DATE '2006-03-01' AND NEW.logdate < DATE '2006-04-01' ) THEN INSERT INTO measurement_y2006m03 VALUES (NEW.*); ... ELSIF ( NEW.logdate >= DATE '2008-01-01' AND NEW.logdate < DATE '2008-02-01' ) THEN INSERT INTO measurement_y2008m01 VALUES (NEW.*); ELSE RAISE EXCEPTION 'Date out of range. Fix the measurement_insert_trigger() function!'; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql;
The trigger definition is the same as before.
Note that each IF
test must exactly match the
CHECK
constraint for its child table.
While this function is more complex than the single-month case, it doesn't need to be updated as often, since branches can be added in advance of being needed.
In practice, it might be best to check the newest child first, if most inserts go into that child. For simplicity, we have shown the trigger's tests in the same order as in other parts of this example.
A different approach to redirecting inserts into the appropriate child table is to set up rules, instead of a trigger, on the root table. For example:
CREATE RULE measurement_insert_y2006m02 AS ON INSERT TO measurement WHERE ( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' ) DO INSTEAD INSERT INTO measurement_y2006m02 VALUES (NEW.*); ... CREATE RULE measurement_insert_y2008m01 AS ON INSERT TO measurement WHERE ( logdate >= DATE '2008-01-01' AND logdate < DATE '2008-02-01' ) DO INSTEAD INSERT INTO measurement_y2008m01 VALUES (NEW.*);
A rule has significantly more overhead than a trigger, but the overhead is paid once per query rather than once per row, so this method might be advantageous for bulk-insert situations. In most cases, however, the trigger method will offer better performance.
Be aware that COPY
ignores rules. If you want to
use COPY
to insert data, you'll need to copy into the
correct child table rather than directly into the root. COPY
does fire triggers, so you can use it normally if you use the trigger
approach.
Another disadvantage of the rule approach is that there is no simple way to force an error if the set of rules doesn't cover the insertion date; the data will silently go into the root table instead.
Ensure that the constraint_exclusion
configuration parameter is not disabled in
postgresql.conf
; otherwise
child tables may be accessed unnecessarily.
As we can see, a complex table hierarchy could require a substantial amount of DDL. In the above example we would be creating a new child table each month, so it might be wise to write a script that generates the required DDL automatically.
To remove old data quickly, simply drop the child table that is no longer necessary:
DROP TABLE measurement_y2006m02;
To remove the child table from the inheritance hierarchy table but retain access to it as a table in its own right:
ALTER TABLE measurement_y2006m02 NO INHERIT measurement;
To add a new child table to handle new data, create an empty child table just as the original children were created above:
CREATE TABLE measurement_y2008m02 ( CHECK ( logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01' ) ) INHERITS (measurement);
Alternatively, one may want to create and populate the new child table before adding it to the table hierarchy. This could allow data to be loaded, checked, and transformed before being made visible to queries on the parent table.
CREATE TABLE measurement_y2008m02 (LIKE measurement INCLUDING DEFAULTS INCLUDING CONSTRAINTS); ALTER TABLE measurement_y2008m02 ADD CONSTRAINT y2008m02 CHECK ( logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01' ); \copy measurement_y2008m02 from 'measurement_y2008m02' -- possibly some other data preparation work ALTER TABLE measurement_y2008m02 INHERIT measurement;
The following caveats apply to partitioning implemented using inheritance:
There is no automatic way to verify that all of the
CHECK
constraints are mutually
exclusive. It is safer to create code that generates
child tables and creates and/or modifies associated objects than
to write each by hand.
Indexes and foreign key constraints apply to single tables and not to their inheritance children, hence they have some caveats to be aware of.
The schemes shown here assume that the values of a row's key column(s)
never change, or at least do not change enough to require it to move to another partition.
An UPDATE
that attempts
to do that will fail because of the CHECK
constraints.
If you need to handle such cases, you can put suitable update triggers
on the child tables, but it makes management of the structure
much more complicated.
If you are using manual VACUUM
or
ANALYZE
commands, don't forget that
you need to run them on each child table individually. A command like:
ANALYZE measurement;
will only process the root table.
INSERT
statements with ON CONFLICT
clauses are unlikely to work as expected, as the ON CONFLICT
action is only taken in case of unique violations on the specified
target relation, not its child relations.
Triggers or rules will be needed to route rows to the desired child table, unless the application is explicitly aware of the partitioning scheme. Triggers may be complicated to write, and will be much slower than the tuple routing performed internally by declarative partitioning.
Partition pruning is a query optimization technique that improves performance for declaratively partitioned tables. As an example:
SET enable_partition_pruning = on; -- the default SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01';
Without partition pruning, the above query would scan each of the
partitions of the measurement
table. With
partition pruning enabled, the planner will examine the definition
of each partition and prove that the partition need not
be scanned because it could not contain any rows meeting the query's
WHERE
clause. When the planner can prove this, it
excludes (prunes) the partition from the query
plan.
By using the EXPLAIN command and the enable_partition_pruning configuration parameter, it's possible to show the difference between a plan for which partitions have been pruned and one for which they have not. A typical unoptimized plan for this type of table setup is:
SET enable_partition_pruning = off; EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01'; QUERY PLAN ----------------------------------------------------------------------------------- Aggregate (cost=188.76..188.77 rows=1 width=8) -> Append (cost=0.00..181.05 rows=3085 width=0) -> Seq Scan on measurement_y2006m02 (cost=0.00..33.12 rows=617 width=0) Filter: (logdate >= '2008-01-01'::date) -> Seq Scan on measurement_y2006m03 (cost=0.00..33.12 rows=617 width=0) Filter: (logdate >= '2008-01-01'::date) ... -> Seq Scan on measurement_y2007m11 (cost=0.00..33.12 rows=617 width=0) Filter: (logdate >= '2008-01-01'::date) -> Seq Scan on measurement_y2007m12 (cost=0.00..33.12 rows=617 width=0) Filter: (logdate >= '2008-01-01'::date) -> Seq Scan on measurement_y2008m01 (cost=0.00..33.12 rows=617 width=0) Filter: (logdate >= '2008-01-01'::date)
Some or all of the partitions might use index scans instead of full-table sequential scans, but the point here is that there is no need to scan the older partitions at all to answer this query. When we enable partition pruning, we get a significantly cheaper plan that will deliver the same answer:
SET enable_partition_pruning = on; EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01'; QUERY PLAN ----------------------------------------------------------------------------------- Aggregate (cost=37.75..37.76 rows=1 width=8) -> Seq Scan on measurement_y2008m01 (cost=0.00..33.12 rows=617 width=0) Filter: (logdate >= '2008-01-01'::date)
Note that partition pruning is driven only by the constraints defined implicitly by the partition keys, not by the presence of indexes. Therefore it isn't necessary to define indexes on the key columns. Whether an index needs to be created for a given partition depends on whether you expect that queries that scan the partition will generally scan a large part of the partition or just a small part. An index will be helpful in the latter case but not the former.
Partition pruning can be performed not only during the planning of a
given query, but also during its execution. This is useful as it can
allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time, for example,
parameters defined in a PREPARE
statement, using a
value obtained from a subquery, or using a parameterized value on the
inner side of a nested loop join. Partition pruning during execution
can be performed at any of the following times:
During initialization of the query plan. Partition pruning can be
performed here for parameter values which are known during the
initialization phase of execution. Partitions which are pruned
during this stage will not show up in the query's
EXPLAIN
or EXPLAIN ANALYZE
.
It is possible to determine the number of partitions which were
removed during this phase by observing the
“Subplans Removed” property in the
EXPLAIN
output.
During actual execution of the query plan. Partition pruning may
also be performed here to remove partitions using values which are
only known during actual query execution. This includes values
from subqueries and values from execution-time parameters such as
those from parameterized nested loop joins. Since the value of
these parameters may change many times during the execution of the
query, partition pruning is performed whenever one of the
execution parameters being used by partition pruning changes.
Determining if partitions were pruned during this phase requires
careful inspection of the loops
property in
the EXPLAIN ANALYZE
output. Subplans
corresponding to different partitions may have different values
for it depending on how many times each of them was pruned during
execution. Some may be shown as (never executed)
if they were pruned every time.
Partition pruning can be disabled using the enable_partition_pruning setting.
Constraint exclusion is a query optimization technique similar to partition pruning. While it is primarily used for partitioning implemented using the legacy inheritance method, it can be used for other purposes, including with declarative partitioning.
Constraint exclusion works in a very similar way to partition
pruning, except that it uses each table's CHECK
constraints — which gives it its name — whereas partition
pruning uses the table's partition bounds, which exist only in the
case of declarative partitioning. Another difference is that
constraint exclusion is only applied at plan time; there is no attempt
to remove partitions at execution time.
The fact that constraint exclusion uses CHECK
constraints, which makes it slow compared to partition pruning, can
sometimes be used as an advantage: because constraints can be defined
even on declaratively-partitioned tables, in addition to their internal
partition bounds, constraint exclusion may be able
to elide additional partitions from the query plan.
The default (and recommended) setting of
constraint_exclusion is neither
on
nor off
, but an intermediate setting
called partition
, which causes the technique to be
applied only to queries that are likely to be working on inheritance partitioned
tables. The on
setting causes the planner to examine
CHECK
constraints in all queries, even simple ones that
are unlikely to benefit.
The following caveats apply to constraint exclusion:
Constraint exclusion is only applied during query planning, unlike partition pruning, which can also be applied during query execution.
Constraint exclusion only works when the query's WHERE
clause contains constants (or externally supplied parameters).
For example, a comparison against a non-immutable function such as
CURRENT_TIMESTAMP
cannot be optimized, since the
planner cannot know which child table the function's value might fall
into at run time.
Keep the partitioning constraints simple, else the planner may not be able to prove that child tables might not need to be visited. Use simple equality conditions for list partitioning, or simple range tests for range partitioning, as illustrated in the preceding examples. A good rule of thumb is that partitioning constraints should contain only comparisons of the partitioning column(s) to constants using B-tree-indexable operators, because only B-tree-indexable column(s) are allowed in the partition key.
All constraints on all children of the parent table are examined during constraint exclusion, so large numbers of children are likely to increase query planning time considerably. So the legacy inheritance based partitioning will work well with up to perhaps a hundred child tables; don't try to use many thousands of children.
The choice of how to partition a table should be made carefully, as the performance of query planning and execution can be negatively affected by poor design.
One of the most critical design decisions will be the column or columns
by which you partition your data. Often the best choice will be to
partition by the column or set of columns which most commonly appear in
WHERE
clauses of queries being executed on the
partitioned table. WHERE
clauses that are compatible
with the partition bound constraints can be used to prune unneeded
partitions. However, you may be forced into making other decisions by
requirements for the PRIMARY KEY
or a
UNIQUE
constraint. Removal of unwanted data is also a
factor to consider when planning your partitioning strategy. An entire
partition can be detached fairly quickly, so it may be beneficial to
design the partition strategy in such a way that all data to be removed
at once is located in a single partition.
Choosing the target number of partitions that the table should be divided
into is also a critical decision to make. Not having enough partitions
may mean that indexes remain too large and that data locality remains poor
which could result in low cache hit ratios. However, dividing the table
into too many partitions can also cause issues. Too many partitions can
mean longer query planning times and higher memory consumption during both
query planning and execution, as further described below.
When choosing how to partition your table,
it's also important to consider what changes may occur in the future. For
example, if you choose to have one partition per customer and you
currently have a small number of large customers, consider the
implications if in several years you instead find yourself with a large
number of small customers. In this case, it may be better to choose to
partition by HASH
and choose a reasonable number of
partitions rather than trying to partition by LIST
and
hoping that the number of customers does not increase beyond what it is
practical to partition the data by.
Sub-partitioning can be useful to further divide partitions that are expected to become larger than other partitions. Another option is to use range partitioning with multiple columns in the partition key. Either of these can easily lead to excessive numbers of partitions, so restraint is advisable.
It is important to consider the overhead of partitioning during query planning and execution. The query planner is generally able to handle partition hierarchies with up to a few thousand partitions fairly well, provided that typical queries allow the query planner to prune all but a small number of partitions. Planning times become longer and memory consumption becomes higher when more partitions remain after the planner performs partition pruning. Another reason to be concerned about having a large number of partitions is that the server's memory consumption may grow significantly over time, especially if many sessions touch large numbers of partitions. That's because each partition requires its metadata to be loaded into the local memory of each session that touches it.
With data warehouse type workloads, it can make sense to use a larger number of partitions than with an OLTP type workload. Generally, in data warehouses, query planning time is less of a concern as the majority of processing time is spent during query execution. With either of these two types of workload, it is important to make the right decisions early, as re-partitioning large quantities of data can be painfully slow. Simulations of the intended workload are often beneficial for optimizing the partitioning strategy. Never just assume that more partitions are better than fewer partitions, nor vice-versa.
PostgreSQL implements portions of the SQL/MED specification, allowing you to access data that resides outside PostgreSQL using regular SQL queries. Such data is referred to as foreign data. (Note that this usage is not to be confused with foreign keys, which are a type of constraint within the database.)
Foreign data is accessed with help from a
foreign data wrapper. A foreign data wrapper is a
library that can communicate with an external data source, hiding the
details of connecting to the data source and obtaining data from it.
There are some foreign data wrappers available as contrib
modules; see Appendix F. Other kinds of foreign data
wrappers might be found as third party products. If none of the existing
foreign data wrappers suit your needs, you can write your own; see Chapter 57.
To access foreign data, you need to create a foreign server object, which defines how to connect to a particular external data source according to the set of options used by its supporting foreign data wrapper. Then you need to create one or more foreign tables, which define the structure of the remote data. A foreign table can be used in queries just like a normal table, but a foreign table has no storage in the PostgreSQL server. Whenever it is used, PostgreSQL asks the foreign data wrapper to fetch data from the external source, or transmit data to the external source in the case of update commands.
Accessing remote data may require authenticating to the external data source. This information can be provided by a user mapping, which can provide additional data such as user names and passwords based on the current PostgreSQL role.
For additional information, see CREATE FOREIGN DATA WRAPPER, CREATE SERVER, CREATE USER MAPPING, CREATE FOREIGN TABLE, and IMPORT FOREIGN SCHEMA.
Tables are the central objects in a relational database structure, because they hold your data. But they are not the only objects that exist in a database. Many other kinds of objects can be created to make the use and management of the data more efficient or convenient. They are not discussed in this chapter, but we give you a list here so that you are aware of what is possible:
Views
Functions, procedures, and operators
Data types and domains
Triggers and rewrite rules
Detailed information on these topics appears in Part V.
When you create complex database structures involving many tables with foreign key constraints, views, triggers, functions, etc. you implicitly create a net of dependencies between the objects. For instance, a table with a foreign key constraint depends on the table it references.
To ensure the integrity of the entire database structure, PostgreSQL makes sure that you cannot drop objects that other objects still depend on. For example, attempting to drop the products table we considered in Section 5.4.5, with the orders table depending on it, would result in an error message like this:
DROP TABLE products; ERROR: cannot drop table products because other objects depend on it DETAIL: constraint orders_product_no_fkey on table orders depends on table products HINT: Use DROP ... CASCADE to drop the dependent objects too.
The error message contains a useful hint: if you do not want to bother deleting all the dependent objects individually, you can run:
DROP TABLE products CASCADE;
and all the dependent objects will be removed, as will any objects
that depend on them, recursively. In this case, it doesn't remove
the orders table, it only removes the foreign key constraint.
It stops there because nothing depends on the foreign key constraint.
(If you want to check what DROP ... CASCADE
will do,
run DROP
without CASCADE
and read the
DETAIL
output.)
Almost all DROP
commands in PostgreSQL support
specifying CASCADE
. Of course, the nature of
the possible dependencies varies with the type of the object. You
can also write RESTRICT
instead of
CASCADE
to get the default behavior, which is to
prevent dropping objects that any other objects depend on.
According to the SQL standard, specifying either
RESTRICT
or CASCADE
is
required in a DROP
command. No database system actually
enforces that rule, but whether the default behavior
is RESTRICT
or CASCADE
varies
across systems.
If a DROP
command lists multiple
objects, CASCADE
is only required when there are
dependencies outside the specified group. For example, when saying
DROP TABLE tab1, tab2
the existence of a foreign
key referencing tab1
from tab2
would not mean
that CASCADE
is needed to succeed.
For a user-defined function or procedure whose body is defined as a string literal, PostgreSQL tracks dependencies associated with the function's externally-visible properties, such as its argument and result types, but not dependencies that could only be known by examining the function body. As an example, consider this situation:
CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple'); CREATE TABLE my_colors (color rainbow, note text); CREATE FUNCTION get_color_note (rainbow) RETURNS text AS 'SELECT note FROM my_colors WHERE color = $1' LANGUAGE SQL;
(See Section 38.5 for an explanation of SQL-language
functions.) PostgreSQL will be aware that
the get_color_note
function depends on the rainbow
type: dropping the type would force dropping the function, because its
argument type would no longer be defined. But PostgreSQL
will not consider get_color_note
to depend on
the my_colors
table, and so will not drop the function if
the table is dropped. While there are disadvantages to this approach,
there are also benefits. The function is still valid in some sense if the
table is missing, though executing it would cause an error; creating a new
table of the same name would allow the function to work again.
On the other hand, for a SQL-language function or procedure whose body is written in SQL-standard style, the body is parsed at function definition time and all dependencies recognized by the parser are stored. Thus, if we write the function above as
CREATE FUNCTION get_color_note (rainbow) RETURNS text BEGIN ATOMIC SELECT note FROM my_colors WHERE color = $1; END;
then the function's dependency on the my_colors
table will be known and enforced by DROP
.
Table of Contents
The previous chapter discussed how to create tables and other structures to hold your data. Now it is time to fill the tables with data. This chapter covers how to insert, update, and delete table data. The chapter after this will finally explain how to extract your long-lost data from the database.
When a table is created, it contains no data. The first thing to do before a database can be of much use is to insert data. Data is inserted one row at a time. You can also insert more than one row in a single command, but it is not possible to insert something that is not a complete row. Even if you know only some column values, a complete row must be created.
To create a new row, use the INSERT command. The command requires the table name and column values. For example, consider the products table from Chapter 5:
CREATE TABLE products ( product_no integer, name text, price numeric );
An example command to insert a row would be:
INSERT INTO products VALUES (1, 'Cheese', 9.99);
The data values are listed in the order in which the columns appear in the table, separated by commas. Usually, the data values will be literals (constants), but scalar expressions are also allowed.
The above syntax has the drawback that you need to know the order of the columns in the table. To avoid this you can also list the columns explicitly. For example, both of the following commands have the same effect as the one above:
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99); INSERT INTO products (name, price, product_no) VALUES ('Cheese', 9.99, 1);
Many users consider it good practice to always list the column names.
If you don't have values for all the columns, you can omit some of them. In that case, the columns will be filled with their default values. For example:
INSERT INTO products (product_no, name) VALUES (1, 'Cheese'); INSERT INTO products VALUES (1, 'Cheese');
The second form is a PostgreSQL extension. It fills the columns from the left with as many values as are given, and the rest will be defaulted.
For clarity, you can also request default values explicitly, for individual columns or for the entire row:
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', DEFAULT); INSERT INTO products DEFAULT VALUES;
You can insert multiple rows in a single command:
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99), (2, 'Bread', 1.99), (3, 'Milk', 2.99);
It is also possible to insert the result of a query (which might be no rows, one row, or many rows):
INSERT INTO products (product_no, name, price) SELECT product_no, name, price FROM new_products WHERE release_date = 'today';
This provides the full power of the SQL query mechanism (Chapter 7) for computing the rows to be inserted.
When inserting a lot of data at the same time, consider using the COPY command. It is not as flexible as the INSERT command, but is more efficient. Refer to Section 14.4 for more information on improving bulk loading performance.
The modification of data that is already in the database is referred to as updating. You can update individual rows, all the rows in a table, or a subset of all rows. Each column can be updated separately; the other columns are not affected.
To update existing rows, use the UPDATE command. This requires three pieces of information:
The name of the table and column to update
The new value of the column
Which row(s) to update
Recall from Chapter 5 that SQL does not, in general, provide a unique identifier for rows. Therefore it is not always possible to directly specify which row to update. Instead, you specify which conditions a row must meet in order to be updated. Only if you have a primary key in the table (independent of whether you declared it or not) can you reliably address individual rows by choosing a condition that matches the primary key. Graphical database access tools rely on this fact to allow you to update rows individually.
For example, this command updates all products that have a price of 5 to have a price of 10:
UPDATE products SET price = 10 WHERE price = 5;
This might cause zero, one, or many rows to be updated. It is not an error to attempt an update that does not match any rows.
Let's look at that command in detail. First is the key word
UPDATE
followed by the table name. As usual,
the table name can be schema-qualified, otherwise it is looked up
in the path. Next is the key word SET
followed
by the column name, an equal sign, and the new column value. The
new column value can be any scalar expression, not just a constant.
For example, if you want to raise the price of all products by 10%
you could use:
UPDATE products SET price = price * 1.10;
As you see, the expression for the new value can refer to the existing
value(s) in the row. We also left out the WHERE
clause.
If it is omitted, it means that all rows in the table are updated.
If it is present, only those rows that match the
WHERE
condition are updated. Note that the equals
sign in the SET
clause is an assignment while
the one in the WHERE
clause is a comparison, but
this does not create any ambiguity. Of course, the
WHERE
condition does
not have to be an equality test. Many other operators are
available (see Chapter 9). But the expression
needs to evaluate to a Boolean result.
You can update more than one column in an
UPDATE
command by listing more than one
assignment in the SET
clause. For example:
UPDATE mytable SET a = 5, b = 3, c = 1 WHERE a > 0;
So far we have explained how to add data to tables and how to change data. What remains is to discuss how to remove data that is no longer needed. Just as adding data is only possible in whole rows, you can only remove entire rows from a table. In the previous section we explained that SQL does not provide a way to directly address individual rows. Therefore, removing rows can only be done by specifying conditions that the rows to be removed have to match. If you have a primary key in the table then you can specify the exact row. But you can also remove groups of rows matching a condition, or you can remove all rows in the table at once.
You use the DELETE command to remove rows; the syntax is very similar to the UPDATE command. For instance, to remove all rows from the products table that have a price of 10, use:
DELETE FROM products WHERE price = 10;
If you simply write:
DELETE FROM products;
then all rows in the table will be deleted! Caveat programmer.
Sometimes it is useful to obtain data from modified rows while they are
being manipulated. The INSERT
, UPDATE
,
and DELETE
commands all have an
optional RETURNING
clause that supports this. Use
of RETURNING
avoids performing an extra database query to
collect the data, and is especially valuable when it would otherwise be
difficult to identify the modified rows reliably.
The allowed contents of a RETURNING
clause are the same as
a SELECT
command's output list
(see Section 7.3). It can contain column
names of the command's target table, or value expressions using those
columns. A common shorthand is RETURNING *
, which selects
all columns of the target table in order.
In an INSERT
, the data available to RETURNING
is
the row as it was inserted. This is not so useful in trivial inserts,
since it would just repeat the data provided by the client. But it can
be very handy when relying on computed default values. For example,
when using a serial
column to provide unique identifiers, RETURNING
can return
the ID assigned to a new row:
CREATE TABLE users (firstname text, lastname text, id serial primary key); INSERT INTO users (firstname, lastname) VALUES ('Joe', 'Cool') RETURNING id;
The RETURNING
clause is also very useful
with INSERT ... SELECT
.
In an UPDATE
, the data available to RETURNING
is
the new content of the modified row. For example:
UPDATE products SET price = price * 1.10 WHERE price <= 99.99 RETURNING name, price AS new_price;
In a DELETE
, the data available to RETURNING
is
the content of the deleted row. For example:
DELETE FROM products WHERE obsoletion_date = 'today' RETURNING *;
If there are triggers (Chapter 39) on the target table,
the data available to RETURNING
is the row as modified by
the triggers. Thus, inspecting columns computed by triggers is another
common use-case for RETURNING
.
Table of Contents
The previous chapters explained how to create tables, how to fill them with data, and how to manipulate that data. Now we finally discuss how to retrieve the data from the database.
The process of retrieving or the command to retrieve data from a
database is called a query. In SQL the
SELECT
command is
used to specify queries. The general syntax of the
SELECT
command is
[WITHwith_queries
] SELECTselect_list
FROMtable_expression
[sort_specification
]
The following sections describe the details of the select list, the
table expression, and the sort specification. WITH
queries are treated last since they are an advanced feature.
A simple kind of query has the form:
SELECT * FROM table1;
Assuming that there is a table called table1
,
this command would retrieve all rows and all user-defined columns from
table1
. (The method of retrieval depends on the
client application. For example, the
psql program will display an ASCII-art
table on the screen, while client libraries will offer functions to
extract individual values from the query result.) The select list
specification *
means all columns that the table
expression happens to provide. A select list can also select a
subset of the available columns or make calculations using the
columns. For example, if
table1
has columns named a
,
b
, and c
(and perhaps others) you can make
the following query:
SELECT a, b + c FROM table1;
(assuming that b
and c
are of a numerical
data type).
See Section 7.3 for more details.
FROM table1
is a simple kind of
table expression: it reads just one table. In general, table
expressions can be complex constructs of base tables, joins, and
subqueries. But you can also omit the table expression entirely and
use the SELECT
command as a calculator:
SELECT 3 * 4;
This is more useful if the expressions in the select list return varying results. For example, you could call a function this way:
SELECT random();
A table expression computes a table. The
table expression contains a FROM
clause that is
optionally followed by WHERE
, GROUP BY
, and
HAVING
clauses. Trivial table expressions simply refer
to a table on disk, a so-called base table, but more complex
expressions can be used to modify or combine base tables in various
ways.
The optional WHERE
, GROUP BY
, and
HAVING
clauses in the table expression specify a
pipeline of successive transformations performed on the table
derived in the FROM
clause. All these transformations
produce a virtual table that provides the rows that are passed to
the select list to compute the output rows of the query.
FROM
Clause
The FROM
clause derives a
table from one or more other tables given in a comma-separated
table reference list.
FROMtable_reference
[,table_reference
[, ...]]
A table reference can be a table name (possibly schema-qualified),
or a derived table such as a subquery, a JOIN
construct, or
complex combinations of these. If more than one table reference is
listed in the FROM
clause, the tables are cross-joined
(that is, the Cartesian product of their rows is formed; see below).
The result of the FROM
list is an intermediate virtual
table that can then be subject to
transformations by the WHERE
, GROUP BY
,
and HAVING
clauses and is finally the result of the
overall table expression.
When a table reference names a table that is the parent of a
table inheritance hierarchy, the table reference produces rows of
not only that table but all of its descendant tables, unless the
key word ONLY
precedes the table name. However, the
reference produces only the columns that appear in the named table
— any columns added in subtables are ignored.
Instead of writing ONLY
before the table name, you can write
*
after the table name to explicitly specify that descendant
tables are included. There is no real reason to use this syntax any more,
because searching descendant tables is now always the default behavior.
However, it is supported for compatibility with older releases.
A joined table is a table derived from two other (real or derived) tables according to the rules of the particular join type. Inner, outer, and cross-joins are available. The general syntax of a joined table is
T1
join_type
T2
[join_condition
]
Joins of all types can be chained together, or nested: either or
both T1
and
T2
can be joined tables. Parentheses
can be used around JOIN
clauses to control the join
order. In the absence of parentheses, JOIN
clauses
nest left-to-right.
Join Types
T1
CROSS JOINT2
For every possible combination of rows from
T1
and
T2
(i.e., a Cartesian product),
the joined table will contain a
row consisting of all columns in T1
followed by all columns in T2
. If
the tables have N and M rows respectively, the joined
table will have N * M rows.
FROM
is equivalent to
T1
CROSS JOIN
T2
FROM
(see below).
It is also equivalent to
T1
INNER JOIN
T2
ON TRUEFROM
.
T1
,
T2
This latter equivalence does not hold exactly when more than two
tables appear, because JOIN
binds more tightly than
comma. For example
FROM
is not the same as
T1
CROSS JOIN
T2
INNER JOIN T3
ON condition
FROM
because the T1
,
T2
INNER JOIN T3
ON condition
condition
can
reference T1
in the first case but not
the second.
T1
{ [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOINT2
ONboolean_expression
T1
{ [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOINT2
USING (join column list
)T1
NATURAL { [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOINT2
The words INNER
and
OUTER
are optional in all forms.
INNER
is the default;
LEFT
, RIGHT
, and
FULL
imply an outer join.
The join condition is specified in the
ON
or USING
clause, or implicitly by
the word NATURAL
. The join condition determines
which rows from the two source tables are considered to
“match”, as explained in detail below.
The possible types of qualified join are:
INNER JOIN
For each row R1 of T1, the joined table has a row for each row in T2 that satisfies the join condition with R1.
LEFT OUTER JOIN
First, an inner join is performed. Then, for each row in T1 that does not satisfy the join condition with any row in T2, a joined row is added with null values in columns of T2. Thus, the joined table always has at least one row for each row in T1.
RIGHT OUTER JOIN
First, an inner join is performed. Then, for each row in T2 that does not satisfy the join condition with any row in T1, a joined row is added with null values in columns of T1. This is the converse of a left join: the result table will always have a row for each row in T2.
FULL OUTER JOIN
First, an inner join is performed. Then, for each row in T1 that does not satisfy the join condition with any row in T2, a joined row is added with null values in columns of T2. Also, for each row of T2 that does not satisfy the join condition with any row in T1, a joined row with null values in the columns of T1 is added.
The ON
clause is the most general kind of join
condition: it takes a Boolean value expression of the same
kind as is used in a WHERE
clause. A pair of rows
from T1
and T2
match if the
ON
expression evaluates to true.
The USING
clause is a shorthand that allows you to take
advantage of the specific situation where both sides of the join use
the same name for the joining column(s). It takes a
comma-separated list of the shared column names
and forms a join condition that includes an equality comparison
for each one. For example, joining T1
and T2
with USING (a, b)
produces
the join condition ON
.
T1
.a
= T2
.a AND T1
.b
= T2
.b
Furthermore, the output of JOIN USING
suppresses
redundant columns: there is no need to print both of the matched
columns, since they must have equal values. While JOIN
ON
produces all columns from T1
followed by all
columns from T2
, JOIN USING
produces one
output column for each of the listed column pairs (in the listed
order), followed by any remaining columns from T1
,
followed by any remaining columns from T2
.
Finally, NATURAL
is a shorthand form of
USING
: it forms a USING
list
consisting of all column names that appear in both
input tables. As with USING
, these columns appear
only once in the output table. If there are no common
column names, NATURAL JOIN
behaves like
JOIN ... ON TRUE
, producing a cross-product join.
USING
is reasonably safe from column changes
in the joined relations since only the listed columns
are combined. NATURAL
is considerably more risky since
any schema changes to either relation that cause a new matching
column name to be present will cause the join to combine that new
column as well.
To put this together, assume we have tables t1
:
num | name -----+------ 1 | a 2 | b 3 | c
and t2
:
num | value -----+------- 1 | xxx 3 | yyy 5 | zzz
then we get the following results for the various joins:
=>
SELECT * FROM t1 CROSS JOIN t2;
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx 1 | a | 3 | yyy 1 | a | 5 | zzz 2 | b | 1 | xxx 2 | b | 3 | yyy 2 | b | 5 | zzz 3 | c | 1 | xxx 3 | c | 3 | yyy 3 | c | 5 | zzz (9 rows)=>
SELECT * FROM t1 INNER JOIN t2 ON t1.num = t2.num;
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx 3 | c | 3 | yyy (2 rows)=>
SELECT * FROM t1 INNER JOIN t2 USING (num);
num | name | value -----+------+------- 1 | a | xxx 3 | c | yyy (2 rows)=>
SELECT * FROM t1 NATURAL INNER JOIN t2;
num | name | value -----+------+------- 1 | a | xxx 3 | c | yyy (2 rows)=>
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num;
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx 2 | b | | 3 | c | 3 | yyy (3 rows)=>
SELECT * FROM t1 LEFT JOIN t2 USING (num);
num | name | value -----+------+------- 1 | a | xxx 2 | b | 3 | c | yyy (3 rows)=>
SELECT * FROM t1 RIGHT JOIN t2 ON t1.num = t2.num;
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx 3 | c | 3 | yyy | | 5 | zzz (3 rows)=>
SELECT * FROM t1 FULL JOIN t2 ON t1.num = t2.num;
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx 2 | b | | 3 | c | 3 | yyy | | 5 | zzz (4 rows)
The join condition specified with ON
can also contain
conditions that do not relate directly to the join. This can
prove useful for some queries but needs to be thought out
carefully. For example:
=>
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num AND t2.value = 'xxx';
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx 2 | b | | 3 | c | | (3 rows)
Notice that placing the restriction in the WHERE
clause
produces a different result:
=>
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx';
num | name | num | value -----+------+-----+------- 1 | a | 1 | xxx (1 row)
This is because a restriction placed in the ON
clause is processed before the join, while
a restriction placed in the WHERE
clause is processed
after the join.
That does not matter with inner joins, but it matters a lot with outer
joins.
A temporary name can be given to tables and complex table references to be used for references to the derived table in the rest of the query. This is called a table alias.
To create a table alias, write
FROMtable_reference
ASalias
or
FROMtable_reference
alias
The AS
key word is optional noise.
alias
can be any identifier.
A typical application of table aliases is to assign short identifiers to long table names to keep the join clauses readable. For example:
SELECT * FROM some_very_long_table_name s JOIN another_fairly_long_name a ON s.id = a.num;
The alias becomes the new name of the table reference so far as the current query is concerned — it is not allowed to refer to the table by the original name elsewhere in the query. Thus, this is not valid:
SELECT * FROM my_table AS m WHERE my_table.a > 5; -- wrong
Table aliases are mainly for notational convenience, but it is necessary to use them when joining a table to itself, e.g.:
SELECT * FROM people AS mother JOIN people AS child ON mother.id = child.mother_id;
Additionally, an alias is required if the table reference is a subquery (see Section 7.2.1.3).
Parentheses are used to resolve ambiguities. In the following example,
the first statement assigns the alias b
to the second
instance of my_table
, but the second statement assigns the
alias to the result of the join:
SELECT * FROM my_table AS a CROSS JOIN my_table AS b ... SELECT * FROM (my_table AS a CROSS JOIN my_table) AS b ...
Another form of table aliasing gives temporary names to the columns of the table, as well as the table itself:
FROMtable_reference
[AS]alias
(column1
[,column2
[, ...]] )
If fewer column aliases are specified than the actual table has columns, the remaining columns are not renamed. This syntax is especially useful for self-joins or subqueries.
When an alias is applied to the output of a JOIN
clause, the alias hides the original
name(s) within the JOIN
. For example:
SELECT a.* FROM my_table AS a JOIN your_table AS b ON ...
is valid SQL, but:
SELECT a.* FROM (my_table AS a JOIN your_table AS b ON ...) AS c
is not valid; the table alias a
is not visible
outside the alias c
.
Subqueries specifying a derived table must be enclosed in parentheses and must be assigned a table alias name (as in Section 7.2.1.2). For example:
FROM (SELECT * FROM table1) AS alias_name
This example is equivalent to FROM table1 AS
alias_name
. More interesting cases, which cannot be
reduced to a plain join, arise when the subquery involves
grouping or aggregation.
A subquery can also be a VALUES
list:
FROM (VALUES ('anne', 'smith'), ('bob', 'jones'), ('joe', 'blow')) AS names(first, last)
Again, a table alias is required. Assigning alias names to the columns
of the VALUES
list is optional, but is good practice.
For more information see Section 7.7.
Table functions are functions that produce a set of rows, made up
of either base data types (scalar types) or composite data types
(table rows). They are used like a table, view, or subquery in
the FROM
clause of a query. Columns returned by table
functions can be included in SELECT
,
JOIN
, or WHERE
clauses in the same manner
as columns of a table, view, or subquery.
Table functions may also be combined using the ROWS FROM
syntax, with the results returned in parallel columns; the number of
result rows in this case is that of the largest function result, with
smaller results padded with null values to match.
function_call
[WITH ORDINALITY] [[AS]table_alias
[(column_alias
[, ... ])]] ROWS FROM(function_call
[, ... ] ) [WITH ORDINALITY] [[AS]table_alias
[(column_alias
[, ... ])]]
If the WITH ORDINALITY
clause is specified, an
additional column of type bigint
will be added to the
function result columns. This column numbers the rows of the function
result set, starting from 1. (This is a generalization of the
SQL-standard syntax for UNNEST ... WITH ORDINALITY
.)
By default, the ordinal column is called ordinality
, but
a different column name can be assigned to it using
an AS
clause.
The special table function UNNEST
may be called with
any number of array parameters, and it returns a corresponding number of
columns, as if UNNEST
(Section 9.19) had been called on each parameter
separately and combined using the ROWS FROM
construct.
UNNEST(array_expression
[, ... ] ) [WITH ORDINALITY] [[AS]table_alias
[(column_alias
[, ... ])]]
If no table_alias
is specified, the function
name is used as the table name; in the case of a ROWS FROM()
construct, the first function's name is used.
If column aliases are not supplied, then for a function returning a base data type, the column name is also the same as the function name. For a function returning a composite type, the result columns get the names of the individual attributes of the type.
Some examples:
CREATE TABLE foo (fooid int, foosubid int, fooname text); CREATE FUNCTION getfoo(int) RETURNS SETOF foo AS $$ SELECT * FROM foo WHERE fooid = $1; $$ LANGUAGE SQL; SELECT * FROM getfoo(1) AS t1; SELECT * FROM foo WHERE foosubid IN ( SELECT foosubid FROM getfoo(foo.fooid) z WHERE z.fooid = foo.fooid ); CREATE VIEW vw_getfoo AS SELECT * FROM getfoo(1); SELECT * FROM vw_getfoo;
In some cases it is useful to define table functions that can
return different column sets depending on how they are invoked.
To support this, the table function can be declared as returning
the pseudo-type record
with no OUT
parameters. When such a function is used in
a query, the expected row structure must be specified in the
query itself, so that the system can know how to parse and plan
the query. This syntax looks like:
function_call
[AS]alias
(column_definition
[, ... ])function_call
AS [alias
] (column_definition
[, ... ]) ROWS FROM( ...function_call
AS (column_definition
[, ... ]) [, ... ] )
When not using the ROWS FROM()
syntax,
the column_definition
list replaces the column
alias list that could otherwise be attached to the FROM
item; the names in the column definitions serve as column aliases.
When using the ROWS FROM()
syntax,
a column_definition
list can be attached to
each member function separately; or if there is only one member function
and no WITH ORDINALITY
clause,
a column_definition
list can be written in
place of a column alias list following ROWS FROM()
.
Consider this example:
SELECT * FROM dblink('dbname=mydb', 'SELECT proname, prosrc FROM pg_proc') AS t1(proname name, prosrc text) WHERE proname LIKE 'bytea%';
The dblink function
(part of the dblink module) executes
a remote query. It is declared to return
record
since it might be used for any kind of query.
The actual column set must be specified in the calling query so
that the parser knows, for example, what *
should
expand to.
This example uses ROWS FROM
:
SELECT * FROM ROWS FROM ( json_to_recordset('[{"a":40,"b":"foo"},{"a":"100","b":"bar"}]') AS (a INTEGER, b TEXT), generate_series(1, 3) ) AS x (p, q, s) ORDER BY p; p | q | s -----+-----+--- 40 | foo | 1 100 | bar | 2 | | 3
It joins two functions into a single FROM
target. json_to_recordset()
is instructed
to return two columns, the first integer
and the second text
. The result of
generate_series()
is used directly.
The ORDER BY
clause sorts the column values
as integers.
LATERAL
Subqueries
Subqueries appearing in FROM
can be
preceded by the key word LATERAL
. This allows them to
reference columns provided by preceding FROM
items.
(Without LATERAL
, each subquery is
evaluated independently and so cannot cross-reference any other
FROM
item.)
Table functions appearing in FROM
can also be
preceded by the key word LATERAL
, but for functions the
key word is optional; the function's arguments can contain references
to columns provided by preceding FROM
items in any case.
A LATERAL
item can appear at the top level in the
FROM
list, or within a JOIN
tree. In the latter
case it can also refer to any items that are on the left-hand side of a
JOIN
that it is on the right-hand side of.
When a FROM
item contains LATERAL
cross-references, evaluation proceeds as follows: for each row of the
FROM
item providing the cross-referenced column(s), or
set of rows of multiple FROM
items providing the
columns, the LATERAL
item is evaluated using that
row or row set's values of the columns. The resulting row(s) are
joined as usual with the rows they were computed from. This is
repeated for each row or set of rows from the column source table(s).
A trivial example of LATERAL
is
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
This is not especially useful since it has exactly the same result as the more conventional
SELECT * FROM foo, bar WHERE bar.id = foo.bar_id;
LATERAL
is primarily useful when the cross-referenced
column is necessary for computing the row(s) to be joined. A common
application is providing an argument value for a set-returning function.
For example, supposing that vertices(polygon)
returns the
set of vertices of a polygon, we could identify close-together vertices
of polygons stored in a table with:
SELECT p1.id, p2.id, v1, v2 FROM polygons p1, polygons p2, LATERAL vertices(p1.poly) v1, LATERAL vertices(p2.poly) v2 WHERE (v1 <-> v2) < 10 AND p1.id != p2.id;
This query could also be written
SELECT p1.id, p2.id, v1, v2 FROM polygons p1 CROSS JOIN LATERAL vertices(p1.poly) v1, polygons p2 CROSS JOIN LATERAL vertices(p2.poly) v2 WHERE (v1 <-> v2) < 10 AND p1.id != p2.id;
or in several other equivalent formulations. (As already mentioned,
the LATERAL
key word is unnecessary in this example, but
we use it for clarity.)
It is often particularly handy to LEFT JOIN
to a
LATERAL
subquery, so that source rows will appear in
the result even if the LATERAL
subquery produces no
rows for them. For example, if get_product_names()
returns
the names of products made by a manufacturer, but some manufacturers in
our table currently produce no products, we could find out which ones
those are like this:
SELECT m.name FROM manufacturers m LEFT JOIN LATERAL get_product_names(m.id) pname ON true WHERE pname IS NULL;
WHERE
Clause
The syntax of the WHERE
clause is
WHERE search_condition
where search_condition
is any value
expression (see Section 4.2) that
returns a value of type boolean
.
After the processing of the FROM
clause is done, each
row of the derived virtual table is checked against the search
condition. If the result of the condition is true, the row is
kept in the output table, otherwise (i.e., if the result is
false or null) it is discarded. The search condition typically
references at least one column of the table generated in the
FROM
clause; this is not required, but otherwise the
WHERE
clause will be fairly useless.
The join condition of an inner join can be written either in
the WHERE
clause or in the JOIN
clause.
For example, these table expressions are equivalent:
FROM a, b WHERE a.id = b.id AND b.val > 5
and:
FROM a INNER JOIN b ON (a.id = b.id) WHERE b.val > 5
or perhaps even:
FROM a NATURAL JOIN b WHERE b.val > 5
Which one of these you use is mainly a matter of style. The
JOIN
syntax in the FROM
clause is
probably not as portable to other SQL database management systems,
even though it is in the SQL standard. For
outer joins there is no choice: they must be done in
the FROM
clause. The ON
or USING
clause of an outer join is not equivalent to a
WHERE
condition, because it results in the addition
of rows (for unmatched input rows) as well as the removal of rows
in the final result.
Here are some examples of WHERE
clauses:
SELECT ... FROM fdt WHERE c1 > 5 SELECT ... FROM fdt WHERE c1 IN (1, 2, 3) SELECT ... FROM fdt WHERE c1 IN (SELECT c1 FROM t2) SELECT ... FROM fdt WHERE c1 IN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10) SELECT ... FROM fdt WHERE c1 BETWEEN (SELECT c3 FROM t2 WHERE c2 = fdt.c1 + 10) AND 100 SELECT ... FROM fdt WHERE EXISTS (SELECT c1 FROM t2 WHERE c2 > fdt.c1)
fdt
is the table derived in the
FROM
clause. Rows that do not meet the search
condition of the WHERE
clause are eliminated from
fdt
. Notice the use of scalar subqueries as
value expressions. Just like any other query, the subqueries can
employ complex table expressions. Notice also how
fdt
is referenced in the subqueries.
Qualifying c1
as fdt.c1
is only necessary
if c1
is also the name of a column in the derived
input table of the subquery. But qualifying the column name adds
clarity even when it is not needed. This example shows how the column
naming scope of an outer query extends into its inner queries.
GROUP BY
and HAVING
Clauses
After passing the WHERE
filter, the derived input
table might be subject to grouping, using the GROUP BY
clause, and elimination of group rows using the HAVING
clause.
SELECTselect_list
FROM ... [WHERE ...] GROUP BYgrouping_column_reference
[,grouping_column_reference
]...
The GROUP BY
clause is
used to group together those rows in a table that have the same
values in all the columns listed. The order in which the columns
are listed does not matter. The effect is to combine each set
of rows having common values into one group row that
represents all rows in the group. This is done to
eliminate redundancy in the output and/or compute aggregates that
apply to these groups. For instance:
=>
SELECT * FROM test1;
x | y ---+--- a | 3 c | 2 b | 5 a | 1 (4 rows)=>
SELECT x FROM test1 GROUP BY x;
x --- a b c (3 rows)
In the second query, we could not have written SELECT *
FROM test1 GROUP BY x
, because there is no single value
for the column y
that could be associated with each
group. The grouped-by columns can be referenced in the select list since
they have a single value in each group.
In general, if a table is grouped, columns that are not
listed in GROUP BY
cannot be referenced except in aggregate
expressions. An example with aggregate expressions is:
=>
SELECT x, sum(y) FROM test1 GROUP BY x;
x | sum ---+----- a | 4 b | 5 c | 2 (3 rows)
Here sum
is an aggregate function that
computes a single value over the entire group. More information
about the available aggregate functions can be found in Section 9.21.
Grouping without aggregate expressions effectively calculates the
set of distinct values in a column. This can also be achieved
using the DISTINCT
clause (see Section 7.3.3).
Here is another example: it calculates the total sales for each product (rather than the total sales of all products):
SELECT product_id, p.name, (sum(s.units) * p.price) AS sales FROM products p LEFT JOIN sales s USING (product_id) GROUP BY product_id, p.name, p.price;
In this example, the columns product_id
,
p.name
, and p.price
must be
in the GROUP BY
clause since they are referenced in
the query select list (but see below). The column
s.units
does not have to be in the GROUP
BY
list since it is only used in an aggregate expression
(sum(...)
), which represents the sales
of a product. For each product, the query returns a summary row about
all sales of the product.
If the products table is set up so that, say,
product_id
is the primary key, then it would be
enough to group by product_id
in the above example,
since name and price would be functionally
dependent on the product ID, and so there would be no
ambiguity about which name and price value to return for each product
ID group.
In strict SQL, GROUP BY
can only group by columns of
the source table but PostgreSQL extends
this to also allow GROUP BY
to group by columns in the
select list. Grouping by value expressions instead of simple
column names is also allowed.
If a table has been grouped using GROUP BY
,
but only certain groups are of interest, the
HAVING
clause can be used, much like a
WHERE
clause, to eliminate groups from the result.
The syntax is:
SELECTselect_list
FROM ... [WHERE ...] GROUP BY ... HAVINGboolean_expression
Expressions in the HAVING
clause can refer both to
grouped expressions and to ungrouped expressions (which necessarily
involve an aggregate function).
Example:
=>
SELECT x, sum(y) FROM test1 GROUP BY x HAVING sum(y) > 3;
x | sum ---+----- a | 4 b | 5 (2 rows)=>
SELECT x, sum(y) FROM test1 GROUP BY x HAVING x < 'c';
x | sum ---+----- a | 4 b | 5 (2 rows)
Again, a more realistic example:
SELECT product_id, p.name, (sum(s.units) * (p.price - p.cost)) AS profit FROM products p LEFT JOIN sales s USING (product_id) WHERE s.date > CURRENT_DATE - INTERVAL '4 weeks' GROUP BY product_id, p.name, p.price, p.cost HAVING sum(p.price * s.units) > 5000;
In the example above, the WHERE
clause is selecting
rows by a column that is not grouped (the expression is only true for
sales during the last four weeks), while the HAVING
clause restricts the output to groups with total gross sales over
5000. Note that the aggregate expressions do not necessarily need
to be the same in all parts of the query.
If a query contains aggregate function calls, but no GROUP BY
clause, grouping still occurs: the result is a single group row (or
perhaps no rows at all, if the single row is then eliminated by
HAVING
).
The same is true if it contains a HAVING
clause, even
without any aggregate function calls or GROUP BY
clause.
GROUPING SETS
, CUBE
, and ROLLUP
More complex grouping operations than those described above are possible
using the concept of grouping sets. The data selected by
the FROM
and WHERE
clauses is grouped separately
by each specified grouping set, aggregates computed for each group just as
for simple GROUP BY
clauses, and then the results returned.
For example:
=>
SELECT * FROM items_sold;
brand | size | sales -------+------+------- Foo | L | 10 Foo | M | 20 Bar | M | 15 Bar | L | 5 (4 rows)=>
SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());
brand | size | sum -------+------+----- Foo | | 30 Bar | | 20 | L | 15 | M | 35 | | 50 (5 rows)
Each sublist of GROUPING SETS
may specify zero or more columns
or expressions and is interpreted the same way as though it were directly
in the GROUP BY
clause. An empty grouping set means that all
rows are aggregated down to a single group (which is output even if no
input rows were present), as described above for the case of aggregate
functions with no GROUP BY
clause.
References to the grouping columns or expressions are replaced by null values in result rows for grouping sets in which those columns do not appear. To distinguish which grouping a particular output row resulted from, see Table 9.61.
A shorthand notation is provided for specifying two common types of grouping set. A clause of the form
ROLLUP (e1
,e2
,e3
, ... )
represents the given list of expressions and all prefixes of the list including the empty list; thus it is equivalent to
GROUPING SETS ( (e1
,e2
,e3
, ... ), ... (e1
,e2
), (e1
), ( ) )
This is commonly used for analysis over hierarchical data; e.g., total salary by department, division, and company-wide total.
A clause of the form
CUBE (e1
,e2
, ... )
represents the given list and all of its possible subsets (i.e., the power set). Thus
CUBE ( a, b, c )
is equivalent to
GROUPING SETS ( ( a, b, c ), ( a, b ), ( a, c ), ( a ), ( b, c ), ( b ), ( c ), ( ) )
The individual elements of a CUBE
or ROLLUP
clause may be either individual expressions, or sublists of elements in
parentheses. In the latter case, the sublists are treated as single
units for the purposes of generating the individual grouping sets.
For example:
CUBE ( (a, b), (c, d) )
is equivalent to
GROUPING SETS ( ( a, b, c, d ), ( a, b ), ( c, d ), ( ) )
and
ROLLUP ( a, (b, c), d )
is equivalent to
GROUPING SETS ( ( a, b, c, d ), ( a, b, c ), ( a ), ( ) )
The CUBE
and ROLLUP
constructs can be used either
directly in the GROUP BY
clause, or nested inside a
GROUPING SETS
clause. If one GROUPING SETS
clause
is nested inside another, the effect is the same as if all the elements of
the inner clause had been written directly in the outer clause.
If multiple grouping items are specified in a single GROUP BY
clause, then the final list of grouping sets is the cross product of the
individual items. For example:
GROUP BY a, CUBE (b, c), GROUPING SETS ((d), (e))
is equivalent to
GROUP BY GROUPING SETS ( (a, b, c, d), (a, b, c, e), (a, b, d), (a, b, e), (a, c, d), (a, c, e), (a, d), (a, e) )
When specifying multiple grouping items together, the final set of grouping sets might contain duplicates. For example:
GROUP BY ROLLUP (a, b), ROLLUP (a, c)
is equivalent to
GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a, b), (a, c), (a), (a), (a, c), (a), () )
If these duplicates are undesirable, they can be removed using the
DISTINCT
clause directly on the GROUP BY
.
Therefore:
GROUP BY DISTINCT ROLLUP (a, b), ROLLUP (a, c)
is equivalent to
GROUP BY GROUPING SETS ( (a, b, c), (a, b), (a, c), (a), () )
This is not the same as using SELECT DISTINCT
because the output
rows may still contain duplicates. If any of the ungrouped columns contains NULL,
it will be indistinguishable from the NULL used when that same column is grouped.
The construct (a, b)
is normally recognized in expressions as
a row constructor.
Within the GROUP BY
clause, this does not apply at the top
levels of expressions, and (a, b)
is parsed as a list of
expressions as described above. If for some reason you need
a row constructor in a grouping expression, use ROW(a, b)
.
If the query contains any window functions (see
Section 3.5,
Section 9.22 and
Section 4.2.8), these functions are evaluated
after any grouping, aggregation, and HAVING
filtering is
performed. That is, if the query uses any aggregates, GROUP
BY
, or HAVING
, then the rows seen by the window functions
are the group rows instead of the original table rows from
FROM
/WHERE
.
When multiple window functions are used, all the window functions having
syntactically equivalent PARTITION BY
and ORDER BY
clauses in their window definitions are guaranteed to be evaluated in a
single pass over the data. Therefore they will see the same sort ordering,
even if the ORDER BY
does not uniquely determine an ordering.
However, no guarantees are made about the evaluation of functions having
different PARTITION BY
or ORDER BY
specifications.
(In such cases a sort step is typically required between the passes of
window function evaluations, and the sort is not guaranteed to preserve
ordering of rows that its ORDER BY
sees as equivalent.)
Currently, window functions always require presorted data, and so the
query output will be ordered according to one or another of the window
functions' PARTITION BY
/ORDER BY
clauses.
It is not recommended to rely on this, however. Use an explicit
top-level ORDER BY
clause if you want to be sure the
results are sorted in a particular way.
As shown in the previous section,
the table expression in the SELECT
command
constructs an intermediate virtual table by possibly combining
tables, views, eliminating rows, grouping, etc. This table is
finally passed on to processing by the select list. The select
list determines which columns of the
intermediate table are actually output.
The simplest kind of select list is *
which
emits all columns that the table expression produces. Otherwise,
a select list is a comma-separated list of value expressions (as
defined in Section 4.2). For instance, it
could be a list of column names:
SELECT a, b, c FROM ...
The columns names a
, b
, and c
are either the actual names of the columns of tables referenced
in the FROM
clause, or the aliases given to them as
explained in Section 7.2.1.2. The name
space available in the select list is the same as in the
WHERE
clause, unless grouping is used, in which case
it is the same as in the HAVING
clause.
If more than one table has a column of the same name, the table name must also be given, as in:
SELECT tbl1.a, tbl2.a, tbl1.b FROM ...
When working with multiple tables, it can also be useful to ask for all the columns of a particular table:
SELECT tbl1.*, tbl2.a FROM ...
See Section 8.16.5 for more about
the table_name
.*
notation.
If an arbitrary value expression is used in the select list, it
conceptually adds a new virtual column to the returned table. The
value expression is evaluated once for each result row, with
the row's values substituted for any column references. But the
expressions in the select list do not have to reference any
columns in the table expression of the FROM
clause;
they can be constant arithmetic expressions, for instance.
The entries in the select list can be assigned names for subsequent
processing, such as for use in an ORDER BY
clause
or for display by the client application. For example:
SELECT a AS value, b + c AS sum FROM ...
If no output column name is specified using AS
,
the system assigns a default column name. For simple column references,
this is the name of the referenced column. For function
calls, this is the name of the function. For complex expressions,
the system will generate a generic name.
The AS
key word is usually optional, but in some
cases where the desired column name matches a
PostgreSQL key word, you must write
AS
or double-quote the column name in order to
avoid ambiguity.
(Appendix C shows which key words
require AS
to be used as a column label.)
For example, FROM
is one such key word, so this
does not work:
SELECT a from, b + c AS sum FROM ...
but either of these do:
SELECT a AS from, b + c AS sum FROM ... SELECT a "from", b + c AS sum FROM ...
For greatest safety against possible
future key word additions, it is recommended that you always either
write AS
or double-quote the output column name.
The naming of output columns here is different from that done in
the FROM
clause (see Section 7.2.1.2). It is possible
to rename the same column twice, but the name assigned in
the select list is the one that will be passed on.
DISTINCT
After the select list has been processed, the result table can
optionally be subject to the elimination of duplicate rows. The
DISTINCT
key word is written directly after
SELECT
to specify this:
SELECT DISTINCT select_list
...
(Instead of DISTINCT
the key word ALL
can be used to specify the default behavior of retaining all rows.)
Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.
Alternatively, an arbitrary expression can determine what rows are to be considered distinct:
SELECT DISTINCT ON (expression
[,expression
...])select_list
...
Here expression
is an arbitrary value
expression that is evaluated for all rows. A set of rows for
which all the expressions are equal are considered duplicates, and
only the first row of the set is kept in the output. Note that
the “first row” of a set is unpredictable unless the
query is sorted on enough columns to guarantee a unique ordering
of the rows arriving at the DISTINCT
filter.
(DISTINCT ON
processing occurs after ORDER
BY
sorting.)
The DISTINCT ON
clause is not part of the SQL standard
and is sometimes considered bad style because of the potentially
indeterminate nature of its results. With judicious use of
GROUP BY
and subqueries in FROM
, this
construct can be avoided, but it is often the most convenient
alternative.
UNION
, INTERSECT
, EXCEPT
)The results of two queries can be combined using the set operations union, intersection, and difference. The syntax is
query1
UNION [ALL]query2
query1
INTERSECT [ALL]query2
query1
EXCEPT [ALL]query2
where query1
and
query2
are queries that can use any of
the features discussed up to this point.
UNION
effectively appends the result of
query2
to the result of
query1
(although there is no guarantee
that this is the order in which the rows are actually returned).
Furthermore, it eliminates duplicate rows from its result, in the same
way as DISTINCT
, unless UNION ALL
is used.
INTERSECT
returns all rows that are both in the result
of query1
and in the result of
query2
. Duplicate rows are eliminated
unless INTERSECT ALL
is used.
EXCEPT
returns all rows that are in the result of
query1
but not in the result of
query2
. (This is sometimes called the
difference between two queries.) Again, duplicates
are eliminated unless EXCEPT ALL
is used.
In order to calculate the union, intersection, or difference of two queries, the two queries must be “union compatible”, which means that they return the same number of columns and the corresponding columns have compatible data types, as described in Section 10.5.
Set operations can be combined, for example
query1
UNIONquery2
EXCEPTquery3
which is equivalent to
(query1
UNIONquery2
) EXCEPTquery3
As shown here, you can use parentheses to control the order of
evaluation. Without parentheses, UNION
and EXCEPT
associate left-to-right,
but INTERSECT
binds more tightly than those two
operators. Thus
query1
UNIONquery2
INTERSECTquery3
means
query1
UNION (query2
INTERSECTquery3
)
You can also surround an individual query
with parentheses. This is important if
the query
needs to use any of the clauses
discussed in following sections, such as LIMIT
.
Without parentheses, you'll get a syntax error, or else the clause will
be understood as applying to the output of the set operation rather
than one of its inputs. For example,
SELECT a FROM b UNION SELECT x FROM y LIMIT 10
is accepted, but it means
(SELECT a FROM b UNION SELECT x FROM y) LIMIT 10
not
SELECT a FROM b UNION (SELECT x FROM y LIMIT 10)
ORDER BY
)After a query has produced an output table (after the select list has been processed) it can optionally be sorted. If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.
The ORDER BY
clause specifies the sort order:
SELECTselect_list
FROMtable_expression
ORDER BYsort_expression1
[ASC | DESC] [NULLS { FIRST | LAST }] [,sort_expression2
[ASC | DESC] [NULLS { FIRST | LAST }] ...]
The sort expression(s) can be any expression that would be valid in the query's select list. An example is:
SELECT a, b FROM table1 ORDER BY a + b, c;
When more than one expression is specified,
the later values are used to sort rows that are equal according to the
earlier values. Each expression can be followed by an optional
ASC
or DESC
keyword to set the sort direction to
ascending or descending. ASC
order is the default.
Ascending order puts smaller values first, where
“smaller” is defined in terms of the
<
operator. Similarly, descending order is
determined with the >
operator.
[6]
The NULLS FIRST
and NULLS LAST
options can be
used to determine whether nulls appear before or after non-null values
in the sort ordering. By default, null values sort as if larger than any
non-null value; that is, NULLS FIRST
is the default for
DESC
order, and NULLS LAST
otherwise.
Note that the ordering options are considered independently for each
sort column. For example ORDER BY x, y DESC
means
ORDER BY x ASC, y DESC
, which is not the same as
ORDER BY x DESC, y DESC
.
A sort_expression
can also be the column label or number
of an output column, as in:
SELECT a + b AS sum, c FROM table1 ORDER BY sum; SELECT a, max(b) FROM table1 GROUP BY a ORDER BY 1;
both of which sort by the first output column. Note that an output column name has to stand alone, that is, it cannot be used in an expression — for example, this is not correct:
SELECT a + b AS sum, c FROM table1 ORDER BY sum + c; -- wrong
This restriction is made to reduce ambiguity. There is still
ambiguity if an ORDER BY
item is a simple name that
could match either an output column name or a column from the table
expression. The output column is used in such cases. This would
only cause confusion if you use AS
to rename an output
column to match some other table column's name.
ORDER BY
can be applied to the result of a
UNION
, INTERSECT
, or EXCEPT
combination, but in this case it is only permitted to sort by
output column names or numbers, not by expressions.
LIMIT
and OFFSET
LIMIT
and OFFSET
allow you to retrieve just
a portion of the rows that are generated by the rest of the query:
SELECTselect_list
FROMtable_expression
[ ORDER BY ... ] [ LIMIT {number
| ALL } ] [ OFFSETnumber
]
If a limit count is given, no more than that many rows will be
returned (but possibly fewer, if the query itself yields fewer rows).
LIMIT ALL
is the same as omitting the LIMIT
clause, as is LIMIT
with a NULL argument.
OFFSET
says to skip that many rows before beginning to
return rows. OFFSET 0
is the same as omitting the
OFFSET
clause, as is OFFSET
with a NULL argument.
If both OFFSET
and LIMIT
appear, then OFFSET
rows are
skipped before starting to count the LIMIT
rows that
are returned.
When using LIMIT
, it is important to use an
ORDER BY
clause that constrains the result rows into a
unique order. Otherwise you will get an unpredictable subset of
the query's rows. You might be asking for the tenth through
twentieth rows, but tenth through twentieth in what ordering? The
ordering is unknown, unless you specified ORDER BY
.
The query optimizer takes LIMIT
into account when
generating query plans, so you are very likely to get different
plans (yielding different row orders) depending on what you give
for LIMIT
and OFFSET
. Thus, using
different LIMIT
/OFFSET
values to select
different subsets of a query result will give
inconsistent results unless you enforce a predictable
result ordering with ORDER BY
. This is not a bug; it
is an inherent consequence of the fact that SQL does not promise to
deliver the results of a query in any particular order unless
ORDER BY
is used to constrain the order.
The rows skipped by an OFFSET
clause still have to be
computed inside the server; therefore a large OFFSET
might be inefficient.
VALUES
Lists
VALUES
provides a way to generate a “constant table”
that can be used in a query without having to actually create and populate
a table on-disk. The syntax is
VALUES ( expression
[, ...] ) [, ...]
Each parenthesized list of expressions generates a row in the table.
The lists must all have the same number of elements (i.e., the number
of columns in the table), and corresponding entries in each list must
have compatible data types. The actual data type assigned to each column
of the result is determined using the same rules as for UNION
(see Section 10.5).
As an example:
VALUES (1, 'one'), (2, 'two'), (3, 'three');
will return a table of two columns and three rows. It's effectively equivalent to:
SELECT 1 AS column1, 'one' AS column2 UNION ALL SELECT 2, 'two' UNION ALL SELECT 3, 'three';
By default, PostgreSQL assigns the names
column1
, column2
, etc. to the columns of a
VALUES
table. The column names are not specified by the
SQL standard and different database systems do it differently, so
it's usually better to override the default names with a table alias
list, like this:
=> SELECT * FROM (VALUES (1, 'one'), (2, 'two'), (3, 'three')) AS t (num,letter); num | letter -----+-------- 1 | one 2 | two 3 | three (3 rows)
Syntactically, VALUES
followed by expression lists is
treated as equivalent to:
SELECTselect_list
FROMtable_expression
and can appear anywhere a SELECT
can. For example, you can
use it as part of a UNION
, or attach a
sort_specification
(ORDER BY
,
LIMIT
, and/or OFFSET
) to it. VALUES
is most commonly used as the data source in an INSERT
command,
and next most commonly as a subquery.
For more information see VALUES.
WITH
Queries (Common Table Expressions)
WITH
provides a way to write auxiliary statements for use in a
larger query. These statements, which are often referred to as Common
Table Expressions or CTEs, can be thought of as defining
temporary tables that exist just for one query. Each auxiliary statement
in a WITH
clause can be a SELECT
,
INSERT
, UPDATE
, or DELETE
; and the
WITH
clause itself is attached to a primary statement that can
also be a SELECT
, INSERT
, UPDATE
, or
DELETE
.
SELECT
in WITH
The basic value of SELECT
in WITH
is to
break down complicated queries into simpler parts. An example is:
WITH regional_sales AS ( SELECT region, SUM(amount) AS total_sales FROM orders GROUP BY region ), top_regions AS ( SELECT region FROM regional_sales WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales) ) SELECT region, product, SUM(quantity) AS product_units, SUM(amount) AS product_sales FROM orders WHERE region IN (SELECT region FROM top_regions) GROUP BY region, product;
which displays per-product sales totals in only the top sales regions.
The WITH
clause defines two auxiliary statements named
regional_sales
and top_regions
,
where the output of regional_sales
is used in
top_regions
and the output of top_regions
is used in the primary SELECT
query.
This example could have been written without WITH
,
but we'd have needed two levels of nested sub-SELECT
s. It's a bit
easier to follow this way.
The optional RECURSIVE
modifier changes WITH
from a mere syntactic convenience into a feature that accomplishes
things not otherwise possible in standard SQL. Using
RECURSIVE
, a WITH
query can refer to its own
output. A very simple example is this query to sum the integers from 1
through 100:
WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 ) SELECT sum(n) FROM t;
The general form of a recursive WITH
query is always a
non-recursive term, then UNION
(or
UNION ALL
), then a
recursive term, where only the recursive term can contain
a reference to the query's own output. Such a query is executed as
follows:
Recursive Query Evaluation
Evaluate the non-recursive term. For UNION
(but not
UNION ALL
), discard duplicate rows. Include all remaining
rows in the result of the recursive query, and also place them in a
temporary working table.
So long as the working table is not empty, repeat these steps:
Evaluate the recursive term, substituting the current contents of
the working table for the recursive self-reference.
For UNION
(but not UNION ALL
), discard
duplicate rows and rows that duplicate any previous result row.
Include all remaining rows in the result of the recursive query, and
also place them in a temporary intermediate table.
Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.
While RECURSIVE
allows queries to be specified
recursively, internally such queries are evaluated iteratively.
In the example above, the working table has just a single row in each step,
and it takes on the values from 1 through 100 in successive steps. In
the 100th step, there is no output because of the WHERE
clause, and so the query terminates.
Recursive queries are typically used to deal with hierarchical or tree-structured data. A useful example is this query to find all the direct and indirect sub-parts of a product, given only a table that shows immediate inclusions:
WITH RECURSIVE included_parts(sub_part, part, quantity) AS ( SELECT sub_part, part, quantity FROM parts WHERE part = 'our_product' UNION ALL SELECT p.sub_part, p.part, p.quantity * pr.quantity FROM included_parts pr, parts p WHERE p.part = pr.sub_part ) SELECT sub_part, SUM(quantity) as total_quantity FROM included_parts GROUP BY sub_part
When computing a tree traversal using a recursive query, you might want to order the results in either depth-first or breadth-first order. This can be done by computing an ordering column alongside the other data columns and using that to sort the results at the end. Note that this does not actually control in which order the query evaluation visits the rows; that is as always in SQL implementation-dependent. This approach merely provides a convenient way to order the results afterwards.
To create a depth-first order, we compute for each result row an array of
rows that we have visited so far. For example, consider the following
query that searches a table tree
using a
link
field:
WITH RECURSIVE search_tree(id, link, data) AS ( SELECT t.id, t.link, t.data FROM tree t UNION ALL SELECT t.id, t.link, t.data FROM tree t, search_tree st WHERE t.id = st.link ) SELECT * FROM search_tree;
To add depth-first ordering information, you can write this:
WITH RECURSIVE search_tree(id, link, data, path) AS ( SELECT t.id, t.link, t.data, ARRAY[t.id] FROM tree t UNION ALL SELECT t.id, t.link, t.data, path || t.id FROM tree t, search_tree st WHERE t.id = st.link ) SELECT * FROM search_tree ORDER BY path;
In the general case where more than one field needs to be used to identify
a row, use an array of rows. For example, if we needed to track fields
f1
and f2
:
WITH RECURSIVE search_tree(id, link, data, path) AS ( SELECT t.id, t.link, t.data, ARRAY[ROW(t.f1, t.f2)] FROM tree t UNION ALL SELECT t.id, t.link, t.data, path || ROW(t.f1, t.f2) FROM tree t, search_tree st WHERE t.id = st.link ) SELECT * FROM search_tree ORDER BY path;
Omit the ROW()
syntax in the common case where only one
field needs to be tracked. This allows a simple array rather than a
composite-type array to be used, gaining efficiency.
To create a breadth-first order, you can add a column that tracks the depth of the search, for example:
WITH RECURSIVE search_tree(id, link, data, depth) AS ( SELECT t.id, t.link, t.data, 0 FROM tree t UNION ALL SELECT t.id, t.link, t.data, depth + 1 FROM tree t, search_tree st WHERE t.id = st.link ) SELECT * FROM search_tree ORDER BY depth;
To get a stable sort, add data columns as secondary sorting columns.
The recursive query evaluation algorithm produces its output in breadth-first search order. However, this is an implementation detail and it is perhaps unsound to rely on it. The order of the rows within each level is certainly undefined, so some explicit ordering might be desired in any case.
There is built-in syntax to compute a depth- or breadth-first sort column. For example:
WITH RECURSIVE search_tree(id, link, data) AS ( SELECT t.id, t.link, t.data FROM tree t UNION ALL SELECT t.id, t.link, t.data FROM tree t, search_tree st WHERE t.id = st.link ) SEARCH DEPTH FIRST BY id SET ordercol SELECT * FROM search_tree ORDER BY ordercol; WITH RECURSIVE search_tree(id, link, data) AS ( SELECT t.id, t.link, t.data FROM tree t UNION ALL SELECT t.id, t.link, t.data FROM tree t, search_tree st WHERE t.id = st.link ) SEARCH BREADTH FIRST BY id SET ordercol SELECT * FROM search_tree ORDER BY ordercol;
This syntax is internally expanded to something similar to the above
hand-written forms. The SEARCH
clause specifies whether
depth- or breadth first search is wanted, the list of columns to track for
sorting, and a column name that will contain the result data that can be
used for sorting. That column will implicitly be added to the output rows
of the CTE.
When working with recursive queries it is important to be sure that
the recursive part of the query will eventually return no tuples,
or else the query will loop indefinitely. Sometimes, using
UNION
instead of UNION ALL
can accomplish this
by discarding rows that duplicate previous output rows. However, often a
cycle does not involve output rows that are completely duplicate: it may be
necessary to check just one or a few fields to see if the same point has
been reached before. The standard method for handling such situations is
to compute an array of the already-visited values. For example, consider again
the following query that searches a table graph
using a
link
field:
WITH RECURSIVE search_graph(id, link, data, depth) AS ( SELECT g.id, g.link, g.data, 0 FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1 FROM graph g, search_graph sg WHERE g.id = sg.link ) SELECT * FROM search_graph;
This query will loop if the link
relationships contain
cycles. Because we require a “depth” output, just changing
UNION ALL
to UNION
would not eliminate the looping.
Instead we need to recognize whether we have reached the same row again
while following a particular path of links. We add two columns
is_cycle
and path
to the loop-prone query:
WITH RECURSIVE search_graph(id, link, data, depth, is_cycle, path) AS ( SELECT g.id, g.link, g.data, 0, false, ARRAY[g.id] FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1, g.id = ANY(path), path || g.id FROM graph g, search_graph sg WHERE g.id = sg.link AND NOT is_cycle ) SELECT * FROM search_graph;
Aside from preventing cycles, the array value is often useful in its own right as representing the “path” taken to reach any particular row.
In the general case where more than one field needs to be checked to
recognize a cycle, use an array of rows. For example, if we needed to
compare fields f1
and f2
:
WITH RECURSIVE search_graph(id, link, data, depth, is_cycle, path) AS ( SELECT g.id, g.link, g.data, 0, false, ARRAY[ROW(g.f1, g.f2)] FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1, ROW(g.f1, g.f2) = ANY(path), path || ROW(g.f1, g.f2) FROM graph g, search_graph sg WHERE g.id = sg.link AND NOT is_cycle ) SELECT * FROM search_graph;
Omit the ROW()
syntax in the common case where only one field
needs to be checked to recognize a cycle. This allows a simple array
rather than a composite-type array to be used, gaining efficiency.
There is built-in syntax to simplify cycle detection. The above query can also be written like this:
WITH RECURSIVE search_graph(id, link, data, depth) AS (
SELECT g.id, g.link, g.data, 1
FROM graph g
UNION ALL
SELECT g.id, g.link, g.data, sg.depth + 1
FROM graph g, search_graph sg
WHERE g.id = sg.link
) CYCLE id SET is_cycle USING path
SELECT * FROM search_graph;
and it will be internally rewritten to the above form. The
CYCLE
clause specifies first the list of columns to
track for cycle detection, then a column name that will show whether a
cycle has been detected, and finally the name of another column that will track the
path. The cycle and path columns will implicitly be added to the output
rows of the CTE.
The cycle path column is computed in the same way as the depth-first
ordering column show in the previous section. A query can have both a
SEARCH
and a CYCLE
clause, but a
depth-first search specification and a cycle detection specification would
create redundant computations, so it's more efficient to just use the
CYCLE
clause and order by the path column. If
breadth-first ordering is wanted, then specifying both
SEARCH
and CYCLE
can be useful.
A helpful trick for testing queries
when you are not certain if they might loop is to place a LIMIT
in the parent query. For example, this query would loop forever without
the LIMIT
:
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t
)
SELECT n FROM t LIMIT 100;
This works because PostgreSQL's implementation
evaluates only as many rows of a WITH
query as are actually
fetched by the parent query. Using this trick in production is not
recommended, because other systems might work differently. Also, it
usually won't work if you make the outer query sort the recursive query's
results or join them to some other table, because in such cases the
outer query will usually try to fetch all of the WITH
query's
output anyway.
A useful property of WITH
queries is that they are
normally evaluated only once per execution of the parent query, even if
they are referred to more than once by the parent query or
sibling WITH
queries.
Thus, expensive calculations that are needed in multiple places can be
placed within a WITH
query to avoid redundant work. Another
possible application is to prevent unwanted multiple evaluations of
functions with side-effects.
However, the other side of this coin is that the optimizer is not able to
push restrictions from the parent query down into a multiply-referenced
WITH
query, since that might affect all uses of the
WITH
query's output when it should affect only one.
The multiply-referenced WITH
query will be
evaluated as written, without suppression of rows that the parent query
might discard afterwards. (But, as mentioned above, evaluation might stop
early if the reference(s) to the query demand only a limited number of
rows.)
However, if a WITH
query is non-recursive and
side-effect-free (that is, it is a SELECT
containing
no volatile functions) then it can be folded into the parent query,
allowing joint optimization of the two query levels. By default, this
happens if the parent query references the WITH
query
just once, but not if it references the WITH
query
more than once. You can override that decision by
specifying MATERIALIZED
to force separate calculation
of the WITH
query, or by specifying NOT
MATERIALIZED
to force it to be merged into the parent query.
The latter choice risks duplicate computation of
the WITH
query, but it can still give a net savings if
each usage of the WITH
query needs only a small part
of the WITH
query's full output.
A simple example of these rules is
WITH w AS ( SELECT * FROM big_table ) SELECT * FROM w WHERE key = 123;
This WITH
query will be folded, producing the same
execution plan as
SELECT * FROM big_table WHERE key = 123;
In particular, if there's an index on key
,
it will probably be used to fetch just the rows having key =
123
. On the other hand, in
WITH w AS ( SELECT * FROM big_table ) SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref WHERE w2.key = 123;
the WITH
query will be materialized, producing a
temporary copy of big_table
that is then
joined with itself — without benefit of any index. This query
will be executed much more efficiently if written as
WITH w AS NOT MATERIALIZED ( SELECT * FROM big_table ) SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref WHERE w2.key = 123;
so that the parent query's restrictions can be applied directly
to scans of big_table
.
An example where NOT MATERIALIZED
could be
undesirable is
WITH w AS ( SELECT key, very_expensive_function(val) as f FROM some_table ) SELECT * FROM w AS w1 JOIN w AS w2 ON w1.f = w2.f;
Here, materialization of the WITH
query ensures
that very_expensive_function
is evaluated only
once per table row, not twice.
The examples above only show WITH
being used with
SELECT
, but it can be attached in the same way to
INSERT
, UPDATE
, or DELETE
.
In each case it effectively provides temporary table(s) that can
be referred to in the main command.
WITH
You can use data-modifying statements (INSERT
,
UPDATE
, or DELETE
) in WITH
. This
allows you to perform several different operations in the same query.
An example is:
WITH moved_rows AS ( DELETE FROM products WHERE "date" >= '2010-10-01' AND "date" < '2010-11-01' RETURNING * ) INSERT INTO products_log SELECT * FROM moved_rows;
This query effectively moves rows from products
to
products_log
. The DELETE
in WITH
deletes the specified rows from products
, returning their
contents by means of its RETURNING
clause; and then the
primary query reads that output and inserts it into
products_log
.
A fine point of the above example is that the WITH
clause is
attached to the INSERT
, not the sub-SELECT
within
the INSERT
. This is necessary because data-modifying
statements are only allowed in WITH
clauses that are attached
to the top-level statement. However, normal WITH
visibility
rules apply, so it is possible to refer to the WITH
statement's output from the sub-SELECT
.
Data-modifying statements in WITH
usually have
RETURNING
clauses (see Section 6.4),
as shown in the example above.
It is the output of the RETURNING
clause, not the
target table of the data-modifying statement, that forms the temporary
table that can be referred to by the rest of the query. If a
data-modifying statement in WITH
lacks a RETURNING
clause, then it forms no temporary table and cannot be referred to in
the rest of the query. Such a statement will be executed nonetheless.
A not-particularly-useful example is:
WITH t AS ( DELETE FROM foo ) DELETE FROM bar;
This example would remove all rows from tables foo
and
bar
. The number of affected rows reported to the client
would only include rows removed from bar
.
Recursive self-references in data-modifying statements are not
allowed. In some cases it is possible to work around this limitation by
referring to the output of a recursive WITH
, for example:
WITH RECURSIVE included_parts(sub_part, part) AS ( SELECT sub_part, part FROM parts WHERE part = 'our_product' UNION ALL SELECT p.sub_part, p.part FROM included_parts pr, parts p WHERE p.part = pr.sub_part ) DELETE FROM parts WHERE part IN (SELECT part FROM included_parts);
This query would remove all direct and indirect subparts of a product.
Data-modifying statements in WITH
are executed exactly once,
and always to completion, independently of whether the primary query
reads all (or indeed any) of their output. Notice that this is different
from the rule for SELECT
in WITH
: as stated in the
previous section, execution of a SELECT
is carried only as far
as the primary query demands its output.
The sub-statements in WITH
are executed concurrently with
each other and with the main query. Therefore, when using data-modifying
statements in WITH
, the order in which the specified updates
actually happen is unpredictable. All the statements are executed with
the same snapshot (see Chapter 13), so they
cannot “see” one another's effects on the target tables. This
alleviates the effects of the unpredictability of the actual order of row
updates, and means that RETURNING
data is the only way to
communicate changes between different WITH
sub-statements and
the main query. An example of this is that in
WITH t AS ( UPDATE products SET price = price * 1.05 RETURNING * ) SELECT * FROM products;
the outer SELECT
would return the original prices before the
action of the UPDATE
, while in
WITH t AS ( UPDATE products SET price = price * 1.05 RETURNING * ) SELECT * FROM t;
the outer SELECT
would return the updated data.
Trying to update the same row twice in a single statement is not
supported. Only one of the modifications takes place, but it is not easy
(and sometimes not possible) to reliably predict which one. This also
applies to deleting a row that was already updated in the same statement:
only the update is performed. Therefore you should generally avoid trying
to modify a single row twice in a single statement. In particular avoid
writing WITH
sub-statements that could affect the same rows
changed by the main statement or a sibling sub-statement. The effects
of such a statement will not be predictable.
At present, any table used as the target of a data-modifying statement in
WITH
must not have a conditional rule, nor an ALSO
rule, nor an INSTEAD
rule that expands to multiple statements.
[6]
Actually, PostgreSQL uses the default B-tree
operator class for the expression's data type to determine the sort
ordering for ASC
and DESC
. Conventionally,
data types will be set up so that the <
and
>
operators correspond to this sort ordering,
but a user-defined data type's designer could choose to do something
different.
Table of Contents
pg_lsn
TypePostgreSQL has a rich set of native data types available to users. Users can add new types to PostgreSQL using the CREATE TYPE command.
Table 8.1 shows all the built-in general-purpose data types. Most of the alternative names listed in the “Aliases” column are the names used internally by PostgreSQL for historical reasons. In addition, some internally used or deprecated types are available, but are not listed here.
Table 8.1. Data Types
Name | Aliases | Description |
---|---|---|
bigint | int8 | signed eight-byte integer |
bigserial | serial8 | autoincrementing eight-byte integer |
bit [ ( | fixed-length bit string | |
bit varying [ ( | varbit [ ( | variable-length bit string |
boolean | bool | logical Boolean (true/false) |
box | rectangular box on a plane | |
bytea | binary data (“byte array”) | |
character [ ( | char [ ( | fixed-length character string |
character varying [ ( | varchar [ ( | variable-length character string |
cidr | IPv4 or IPv6 network address | |
circle | circle on a plane | |
date | calendar date (year, month, day) | |
double precision | float8 | double precision floating-point number (8 bytes) |
inet | IPv4 or IPv6 host address | |
integer | int , int4 | signed four-byte integer |
interval [ | time span | |
json | textual JSON data | |
jsonb | binary JSON data, decomposed | |
line | infinite line on a plane | |
lseg | line segment on a plane | |
macaddr | MAC (Media Access Control) address | |
macaddr8 | MAC (Media Access Control) address (EUI-64 format) | |
money | currency amount | |
numeric [ ( | decimal [ ( | exact numeric of selectable precision |
path | geometric path on a plane | |
pg_lsn | PostgreSQL Log Sequence Number | |
pg_snapshot | user-level transaction ID snapshot | |
point | geometric point on a plane | |
polygon | closed geometric path on a plane | |
real | float4 | single precision floating-point number (4 bytes) |
smallint | int2 | signed two-byte integer |
smallserial | serial2 | autoincrementing two-byte integer |
serial | serial4 | autoincrementing four-byte integer |
text | variable-length character string | |
time [ ( | time of day (no time zone) | |
time [ ( | timetz | time of day, including time zone |
timestamp [ ( | date and time (no time zone) | |
timestamp [ ( | timestamptz | date and time, including time zone |
tsquery | text search query | |
tsvector | text search document | |
txid_snapshot | user-level transaction ID snapshot (deprecated; see pg_snapshot ) | |
uuid | universally unique identifier | |
xml | XML data |
The following types (or spellings thereof) are specified by
SQL: bigint
, bit
, bit
varying
, boolean
, char
,
character varying
, character
,
varchar
, date
, double
precision
, integer
, interval
,
numeric
, decimal
, real
,
smallint
, time
(with or without time zone),
timestamp
(with or without time zone),
xml
.
Each data type has an external representation determined by its input and output functions. Many of the built-in types have obvious external formats. However, several types are either unique to PostgreSQL, such as geometric paths, or have several possible formats, such as the date and time types. Some of the input and output functions are not invertible, i.e., the result of an output function might lose accuracy when compared to the original input.
Numeric types consist of two-, four-, and eight-byte integers, four- and eight-byte floating-point numbers, and selectable-precision decimals. Table 8.2 lists the available types.
Table 8.2. Numeric Types
Name | Storage Size | Description | Range |
---|---|---|---|
smallint | 2 bytes | small-range integer | -32768 to +32767 |
integer | 4 bytes | typical choice for integer | -2147483648 to +2147483647 |
bigint | 8 bytes | large-range integer | -9223372036854775808 to +9223372036854775807 |
decimal | variable | user-specified precision, exact | up to 131072 digits before the decimal point; up to 16383 digits after the decimal point |
numeric | variable | user-specified precision, exact | up to 131072 digits before the decimal point; up to 16383 digits after the decimal point |
real | 4 bytes | variable-precision, inexact | 6 decimal digits precision |
double precision | 8 bytes | variable-precision, inexact | 15 decimal digits precision |
smallserial | 2 bytes | small autoincrementing integer | 1 to 32767 |
serial | 4 bytes | autoincrementing integer | 1 to 2147483647 |
bigserial | 8 bytes | large autoincrementing integer | 1 to 9223372036854775807 |
The syntax of constants for the numeric types is described in Section 4.1.2. The numeric types have a full set of corresponding arithmetic operators and functions. Refer to Chapter 9 for more information. The following sections describe the types in detail.
The types smallint
, integer
, and
bigint
store whole numbers, that is, numbers without
fractional components, of various ranges. Attempts to store
values outside of the allowed range will result in an error.
The type integer
is the common choice, as it offers
the best balance between range, storage size, and performance.
The smallint
type is generally only used if disk
space is at a premium. The bigint
type is designed to be
used when the range of the integer
type is insufficient.
SQL only specifies the integer types
integer
(or int
),
smallint
, and bigint
. The
type names int2
, int4
, and
int8
are extensions, which are also used by some
other SQL database systems.
The type numeric
can store numbers with a
very large number of digits. It is especially recommended for
storing monetary amounts and other quantities where exactness is
required. Calculations with numeric
values yield exact
results where possible, e.g., addition, subtraction, multiplication.
However, calculations on numeric
values are very slow
compared to the integer types, or to the floating-point types
described in the next section.
We use the following terms below: The
precision of a numeric
is the total count of significant digits in the whole number,
that is, the number of digits to both sides of the decimal point.
The scale of a numeric
is the
count of decimal digits in the fractional part, to the right of the
decimal point. So the number 23.5141 has a precision of 6 and a
scale of 4. Integers can be considered to have a scale of zero.
Both the maximum precision and the maximum scale of a
numeric
column can be
configured. To declare a column of type numeric
use
the syntax:
NUMERIC(precision
,scale
)
The precision must be positive, the scale zero or positive. Alternatively:
NUMERIC(precision
)
selects a scale of 0. Specifying:
NUMERIC
without any precision or scale creates an “unconstrained
numeric” column in which numeric values of any length can be
stored, up to the implementation limits. A column of this kind will
not coerce input values to any particular scale, whereas
numeric
columns with a declared scale will coerce
input values to that scale. (The SQL standard
requires a default scale of 0, i.e., coercion to integer
precision. We find this a bit useless. If you're concerned
about portability, always specify the precision and scale
explicitly.)
The maximum precision that can be explicitly specified in
a NUMERIC
type declaration is 1000. An
unconstrained NUMERIC
column is subject to the limits
described in Table 8.2.
If the scale of a value to be stored is greater than the declared scale of the column, the system will round the value to the specified number of fractional digits. Then, if the number of digits to the left of the decimal point exceeds the declared precision minus the declared scale, an error is raised.
Numeric values are physically stored without any extra leading or
trailing zeroes. Thus, the declared precision and scale of a column
are maximums, not fixed allocations. (In this sense the numeric
type is more akin to varchar(
than to n
)char(
.) The actual storage
requirement is two bytes for each group of four decimal digits,
plus three to eight bytes overhead.
n
)
In addition to ordinary numeric values, the numeric
type
has several special values:
Infinity
-Infinity
NaN
These are adapted from the IEEE 754 standard, and represent
“infinity”, “negative infinity”, and
“not-a-number”, respectively. When writing these values
as constants in an SQL command, you must put quotes around them,
for example UPDATE table SET x = '-Infinity'
.
On input, these strings are recognized in a case-insensitive manner.
The infinity values can alternatively be spelled inf
and -inf
.
The infinity values behave as per mathematical expectations. For
example, Infinity
plus any finite value equals
Infinity
, as does Infinity
plus Infinity
; but Infinity
minus Infinity
yields NaN
(not a
number), because it has no well-defined interpretation. Note that an
infinity can only be stored in an unconstrained numeric
column, because it notionally exceeds any finite precision limit.
The NaN
(not a number) value is used to represent
undefined calculational results. In general, any operation with
a NaN
input yields another NaN
.
The only exception is when the operation's other inputs are such that
the same output would be obtained if the NaN
were to
be replaced by any finite or infinite numeric value; then, that output
value is used for NaN
too. (An example of this
principle is that NaN
raised to the zero power
yields one.)
In most implementations of the “not-a-number” concept,
NaN
is not considered equal to any other numeric
value (including NaN
). In order to allow
numeric
values to be sorted and used in tree-based
indexes, PostgreSQL treats NaN
values as equal, and greater than all non-NaN
values.
The types decimal
and numeric
are
equivalent. Both types are part of the SQL
standard.
When rounding values, the numeric
type rounds ties away
from zero, while (on most machines) the real
and double precision
types round ties to the nearest even
number. For example:
SELECT x, round(x::numeric) AS num_round, round(x::double precision) AS dbl_round FROM generate_series(-3.5, 3.5, 1) as x; x | num_round | dbl_round ------+-----------+----------- -3.5 | -4 | -4 -2.5 | -3 | -2 -1.5 | -2 | -2 -0.5 | -1 | -0 0.5 | 1 | 0 1.5 | 2 | 2 2.5 | 3 | 2 3.5 | 4 | 4 (8 rows)
The data types real
and double precision
are
inexact, variable-precision numeric types. On all currently supported
platforms, these types are implementations of IEEE
Standard 754 for Binary Floating-Point Arithmetic (single and double
precision, respectively), to the extent that the underlying processor,
operating system, and compiler support it.
Inexact means that some values cannot be converted exactly to the internal format and are stored as approximations, so that storing and retrieving a value might show slight discrepancies. Managing these errors and how they propagate through calculations is the subject of an entire branch of mathematics and computer science and will not be discussed here, except for the following points:
If you require exact storage and calculations (such as for
monetary amounts), use the numeric
type instead.
If you want to do complicated calculations with these types for anything important, especially if you rely on certain behavior in boundary cases (infinity, underflow), you should evaluate the implementation carefully.
Comparing two floating-point values for equality might not always work as expected.
On all currently supported platforms, the real
type has a
range of around 1E-37 to 1E+37 with a precision of at least 6 decimal
digits. The double precision
type has a range of around
1E-307 to 1E+308 with a precision of at least 15 digits. Values that are
too large or too small will cause an error. Rounding might take place if
the precision of an input number is too high. Numbers too close to zero
that are not representable as distinct from zero will cause an underflow
error.
By default, floating point values are output in text form in their
shortest precise decimal representation; the decimal value produced is
closer to the true stored binary value than to any other value
representable in the same binary precision. (However, the output value is
currently never exactly midway between two
representable values, in order to avoid a widespread bug where input
routines do not properly respect the round-to-nearest-even rule.) This value will
use at most 17 significant decimal digits for float8
values, and at most 9 digits for float4
values.
This shortest-precise output format is much faster to generate than the historical rounded format.
For compatibility with output generated by older versions
of PostgreSQL, and to allow the output
precision to be reduced, the extra_float_digits
parameter can be used to select rounded decimal output instead. Setting a
value of 0 restores the previous default of rounding the value to 6
(for float4
) or 15 (for float8
)
significant decimal digits. Setting a negative value reduces the number
of digits further; for example -2 would round output to 4 or 13 digits
respectively.
Any value of extra_float_digits greater than 0 selects the shortest-precise format.
Applications that wanted precise values have historically had to set extra_float_digits to 3 to obtain them. For maximum compatibility between versions, they should continue to do so.
In addition to ordinary numeric values, the floating-point types have several special values:
Infinity
-Infinity
NaN
These represent the IEEE 754 special values
“infinity”, “negative infinity”, and
“not-a-number”, respectively. When writing these values
as constants in an SQL command, you must put quotes around them,
for example UPDATE table SET x = '-Infinity'
. On input,
these strings are recognized in a case-insensitive manner.
The infinity values can alternatively be spelled inf
and -inf
.
IEEE 754 specifies that NaN
should not compare equal
to any other floating-point value (including NaN
).
In order to allow floating-point values to be sorted and used
in tree-based indexes, PostgreSQL treats
NaN
values as equal, and greater than all
non-NaN
values.
PostgreSQL also supports the SQL-standard
notations float
and
float(
for specifying
inexact numeric types. Here, p
)p
specifies
the minimum acceptable precision in binary digits.
PostgreSQL accepts
float(1)
to float(24)
as selecting the
real
type, while
float(25)
to float(53)
select
double precision
. Values of p
outside the allowed range draw an error.
float
with no precision specified is taken to mean
double precision
.
This section describes a PostgreSQL-specific way to create an autoincrementing column. Another way is to use the SQL-standard identity column feature, described at CREATE TABLE.
The data types smallserial
, serial
and
bigserial
are not true types, but merely
a notational convenience for creating unique identifier columns
(similar to the AUTO_INCREMENT
property
supported by some other databases). In the current
implementation, specifying:
CREATE TABLEtablename
(colname
SERIAL );
is equivalent to specifying:
CREATE SEQUENCEtablename
_colname
_seq AS integer; CREATE TABLEtablename
(colname
integer NOT NULL DEFAULT nextval('tablename
_colname
_seq') ); ALTER SEQUENCEtablename
_colname
_seq OWNED BYtablename
.colname
;
Thus, we have created an integer column and arranged for its default
values to be assigned from a sequence generator. A NOT NULL
constraint is applied to ensure that a null value cannot be
inserted. (In most cases you would also want to attach a
UNIQUE
or PRIMARY KEY
constraint to prevent
duplicate values from being inserted by accident, but this is
not automatic.) Lastly, the sequence is marked as “owned by”
the column, so that it will be dropped if the column or table is dropped.
Because smallserial
, serial
and
bigserial
are implemented using sequences, there may
be "holes" or gaps in the sequence of values which appears in the
column, even if no rows are ever deleted. A value allocated
from the sequence is still "used up" even if a row containing that
value is never successfully inserted into the table column. This
may happen, for example, if the inserting transaction rolls back.
See nextval()
in Section 9.17
for details.
To insert the next value of the sequence into the serial
column, specify that the serial
column should be assigned its default value. This can be done
either by excluding the column from the list of columns in
the INSERT
statement, or through the use of
the DEFAULT
key word.
The type names serial
and serial4
are
equivalent: both create integer
columns. The type
names bigserial
and serial8
work
the same way, except that they create a bigint
column. bigserial
should be used if you anticipate
the use of more than 231 identifiers over the
lifetime of the table. The type names smallserial
and
serial2
also work the same way, except that they
create a smallint
column.
The sequence created for a serial
column is
automatically dropped when the owning column is dropped.
You can drop the sequence without dropping the column, but this
will force removal of the column default expression.
The money
type stores a currency amount with a fixed
fractional precision; see Table 8.3. The fractional precision is
determined by the database's lc_monetary setting.
The range shown in the table assumes there are two fractional digits.
Input is accepted in a variety of formats, including integer and
floating-point literals, as well as typical
currency formatting, such as '$1,000.00'
.
Output is generally in the latter form but depends on the locale.
Table 8.3. Monetary Types
Name | Storage Size | Description | Range |
---|---|---|---|
money | 8 bytes | currency amount | -92233720368547758.08 to +92233720368547758.07 |
Since the output of this data type is locale-sensitive, it might not
work to load money
data into a database that has a different
setting of lc_monetary
. To avoid problems, before
restoring a dump into a new database make sure lc_monetary
has
the same or equivalent value as in the database that was dumped.
Values of the numeric
, int
, and
bigint
data types can be cast to money
.
Conversion from the real
and double precision
data types can be done by casting to numeric
first, for
example:
SELECT '12.34'::float8::numeric::money;
However, this is not recommended. Floating point numbers should not be used to handle money due to the potential for rounding errors.
A money
value can be cast to numeric
without
loss of precision. Conversion to other types could potentially lose
precision, and must also be done in two stages:
SELECT '52093.89'::money::numeric::float8;
Division of a money
value by an integer value is performed
with truncation of the fractional part towards zero. To get a rounded
result, divide by a floating-point value, or cast the money
value to numeric
before dividing and back to money
afterwards. (The latter is preferable to avoid risking precision loss.)
When a money
value is divided by another money
value, the result is double precision
(i.e., a pure number,
not money); the currency units cancel each other out in the division.
Table 8.4. Character Types
Name | Description |
---|---|
character varying( , varchar( | variable-length with limit |
character( , char( | fixed-length, blank padded |
text | variable unlimited length |
Table 8.4 shows the general-purpose character types available in PostgreSQL.
SQL defines two primary character types:
character varying(
and
n
)character(
, where n
)n
is a positive integer. Both of these types can store strings up to
n
characters (not bytes) in length. An attempt to store a
longer string into a column of these types will result in an
error, unless the excess characters are all spaces, in which case
the string will be truncated to the maximum length. (This somewhat
bizarre exception is required by the SQL
standard.) If the string to be stored is shorter than the declared
length, values of type character
will be space-padded;
values of type character varying
will simply store the
shorter
string.
If one explicitly casts a value to character
varying(
or
n
)character(
, then an over-length
value will be truncated to n
)n
characters without
raising an error. (This too is required by the
SQL standard.)
The notations varchar(
and
n
)char(
are aliases for n
)character
varying(
and
n
)character(
, respectively.
If specified, the length must be greater than zero and cannot exceed
10485760.
n
)character
without length specifier is equivalent to
character(1)
. If character varying
is used
without length specifier, the type accepts strings of any size. The
latter is a PostgreSQL extension.
In addition, PostgreSQL provides the
text
type, which stores strings of any length.
Although the type text
is not in the
SQL standard, several other SQL database
management systems have it as well.
Values of type character
are physically padded
with spaces to the specified width n
, and are
stored and displayed that way. However, trailing spaces are treated as
semantically insignificant and disregarded when comparing two values
of type character
. In collations where whitespace
is significant, this behavior can produce unexpected results;
for example SELECT 'a '::CHAR(2) collate "C" <
E'a\n'::CHAR(2)
returns true, even though C
locale would consider a space to be greater than a newline.
Trailing spaces are removed when converting a character
value
to one of the other string types. Note that trailing spaces
are semantically significant in
character varying
and text
values, and
when using pattern matching, that is LIKE
and
regular expressions.
The characters that can be stored in any of these data types are determined by the database character set, which is selected when the database is created. Regardless of the specific character set, the character with code zero (sometimes called NUL) cannot be stored. For more information refer to Section 24.3.
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case of
character
. Longer strings have 4 bytes of overhead instead
of 1. Long strings are compressed by the system automatically, so
the physical requirement on disk might be less. Very long values are also
stored in background tables so that they do not interfere with rapid
access to shorter column values. In any case, the longest
possible character string that can be stored is about 1 GB. (The
maximum value that will be allowed for n
in the data
type declaration is less than that. It wouldn't be useful to
change this because with multibyte character encodings the number of
characters and bytes can be quite different. If you desire to
store long strings with no specific upper limit, use
text
or character varying
without a length
specifier, rather than making up an arbitrary length limit.)
There is no performance difference among these three types,
apart from increased storage space when using the blank-padded
type, and a few extra CPU cycles to check the length when storing into
a length-constrained column. While
character(
has performance
advantages in some other database systems, there is no such advantage in
PostgreSQL; in fact
n
)character(
is usually the slowest of
the three because of its additional storage costs. In most situations
n
)text
or character varying
should be used
instead.
Refer to Section 4.1.2.1 for information about the syntax of string literals, and to Chapter 9 for information about available operators and functions.
Example 8.1. Using the Character Types
CREATE TABLE test1 (a character(4)); INSERT INTO test1 VALUES ('ok'); SELECT a, char_length(a) FROM test1; -- (1)a | char_length ------+------------- ok | 2
CREATE TABLE test2 (b varchar(5)); INSERT INTO test2 VALUES ('ok'); INSERT INTO test2 VALUES ('good '); INSERT INTO test2 VALUES ('too long');ERROR: value too long for type character varying(5)
INSERT INTO test2 VALUES ('too long'::varchar(5)); -- explicit truncation SELECT b, char_length(b) FROM test2;b | char_length -------+------------- ok | 2 good | 5 too l | 5
The |
There are two other fixed-length character types in
PostgreSQL, shown in Table 8.5. The name
type exists only for the storage of identifiers
in the internal system catalogs and is not intended for use by the general user. Its
length is currently defined as 64 bytes (63 usable characters plus
terminator) but should be referenced using the constant
NAMEDATALEN
in C
source code.
The length is set at compile time (and
is therefore adjustable for special uses); the default maximum
length might change in a future release. The type "char"
(note the quotes) is different from char(1)
in that it
only uses one byte of storage. It is internally used in the system
catalogs as a simplistic enumeration type.
Table 8.5. Special Character Types
Name | Storage Size | Description |
---|---|---|
"char" | 1 byte | single-byte internal type |
name | 64 bytes | internal type for object names |
The bytea
data type allows storage of binary strings;
see Table 8.6.
Table 8.6. Binary Data Types
Name | Storage Size | Description |
---|---|---|
bytea | 1 or 4 bytes plus the actual binary string | variable-length binary string |
A binary string is a sequence of octets (or bytes). Binary strings are distinguished from character strings in two ways. First, binary strings specifically allow storing octets of value zero and other “non-printable” octets (usually, octets outside the decimal range 32 to 126). Character strings disallow zero octets, and also disallow any other octet values and sequences of octet values that are invalid according to the database's selected character set encoding. Second, operations on binary strings process the actual bytes, whereas the processing of character strings depends on locale settings. In short, binary strings are appropriate for storing data that the programmer thinks of as “raw bytes”, whereas character strings are appropriate for storing text.
The bytea
type supports two
formats for input and output: “hex” format
and PostgreSQL's historical
“escape” format. Both
of these are always accepted on input. The output format depends
on the configuration parameter bytea_output;
the default is hex. (Note that the hex format was introduced in
PostgreSQL 9.0; earlier versions and some
tools don't understand it.)
The SQL standard defines a different binary
string type, called BLOB
or BINARY LARGE
OBJECT
. The input format is different from
bytea
, but the provided functions and operators are
mostly the same.
bytea
Hex Format
The “hex” format encodes binary data as 2 hexadecimal digits
per byte, most significant nibble first. The entire string is
preceded by the sequence \x
(to distinguish it
from the escape format). In some contexts, the initial backslash may
need to be escaped by doubling it
(see Section 4.1.2.1).
For input, the hexadecimal digits can
be either upper or lower case, and whitespace is permitted between
digit pairs (but not within a digit pair nor in the starting
\x
sequence).
The hex format is compatible with a wide
range of external applications and protocols, and it tends to be
faster to convert than the escape format, so its use is preferred.
Example:
SET bytea_output = 'hex'; SELECT '\xDEADBEEF'::bytea; bytea ------------ \xdeadbeef
bytea
Escape Format
The “escape” format is the traditional
PostgreSQL format for the bytea
type. It
takes the approach of representing a binary string as a sequence
of ASCII characters, while converting those bytes that cannot be
represented as an ASCII character into special escape sequences.
If, from the point of view of the application, representing bytes
as characters makes sense, then this representation can be
convenient. But in practice it is usually confusing because it
fuzzes up the distinction between binary strings and character
strings, and also the particular escape mechanism that was chosen is
somewhat unwieldy. Therefore, this format should probably be avoided
for most new applications.
When entering bytea
values in escape format,
octets of certain
values must be escaped, while all octet
values can be escaped. In
general, to escape an octet, convert it into its three-digit
octal value and precede it by a backslash.
Backslash itself (octet decimal value 92) can alternatively be represented by
double backslashes.
Table 8.7
shows the characters that must be escaped, and gives the alternative
escape sequences where applicable.
Table 8.7. bytea
Literal Escaped Octets
Decimal Octet Value | Description | Escaped Input Representation | Example | Hex Representation |
---|---|---|---|---|
0 | zero octet | '\000' | '\000'::bytea | \x00 |
39 | single quote | '''' or '\047' | ''''::bytea | \x27 |
92 | backslash | '\\' or '\134' | '\\'::bytea | \x5c |
0 to 31 and 127 to 255 | “non-printable” octets | '\ (octal value) | '\001'::bytea | \x01 |
The requirement to escape non-printable octets varies depending on locale settings. In some instances you can get away with leaving them unescaped.
The reason that single quotes must be doubled, as shown
in Table 8.7, is that this
is true for any string literal in an SQL command. The generic
string-literal parser consumes the outermost single quotes
and reduces any pair of single quotes to one data character.
What the bytea
input function sees is just one
single quote, which it treats as a plain data character.
However, the bytea
input function treats
backslashes as special, and the other behaviors shown in
Table 8.7 are implemented by
that function.
In some contexts, backslashes must be doubled compared to what is shown above, because the generic string-literal parser will also reduce pairs of backslashes to one data character; see Section 4.1.2.1.
Bytea
octets are output in hex
format by default. If you change bytea_output
to escape
,
“non-printable” octets are converted to their
equivalent three-digit octal value and preceded by one backslash.
Most “printable” octets are output by their standard
representation in the client character set, e.g.:
SET bytea_output = 'escape'; SELECT 'abc \153\154\155 \052\251\124'::bytea; bytea ---------------- abc klm *\251T
The octet with decimal value 92 (backslash) is doubled in the output. Details are in Table 8.8.
Table 8.8. bytea
Output Escaped Octets
Decimal Octet Value | Description | Escaped Output Representation | Example | Output Result |
---|---|---|---|---|
92 | backslash | \\ | '\134'::bytea | \\ |
0 to 31 and 127 to 255 | “non-printable” octets | \ (octal value) | '\001'::bytea | \001 |
32 to 126 | “printable” octets | client character set representation | '\176'::bytea | ~ |
Depending on the front end to PostgreSQL you use,
you might have additional work to do in terms of escaping and
unescaping bytea
strings. For example, you might also
have to escape line feeds and carriage returns if your interface
automatically translates these.
PostgreSQL supports the full set of SQL date and time types, shown in Table 8.9. The operations available on these data types are described in Section 9.9. Dates are counted according to the Gregorian calendar, even in years before that calendar was introduced (see Section B.6 for more information).
Table 8.9. Date/Time Types
Name | Storage Size | Description | Low Value | High Value | Resolution |
---|---|---|---|---|---|
timestamp [ ( | 8 bytes | both date and time (no time zone) | 4713 BC | 294276 AD | 1 microsecond |
timestamp [ ( | 8 bytes | both date and time, with time zone | 4713 BC | 294276 AD | 1 microsecond |
date | 4 bytes | date (no time of day) | 4713 BC | 5874897 AD | 1 day |
time [ ( | 8 bytes | time of day (no date) | 00:00:00 | 24:00:00 | 1 microsecond |
time [ ( | 12 bytes | time of day (no date), with time zone | 00:00:00+1559 | 24:00:00-1559 | 1 microsecond |
interval [ | 16 bytes | time interval | -178000000 years | 178000000 years | 1 microsecond |
The SQL standard requires that writing just timestamp
be equivalent to timestamp without time
zone
, and PostgreSQL honors that
behavior. timestamptz
is accepted as an
abbreviation for timestamp with time zone
; this is a
PostgreSQL extension.
time
, timestamp
, and
interval
accept an optional precision value
p
which specifies the number of
fractional digits retained in the seconds field. By default, there
is no explicit bound on precision. The allowed range of
p
is from 0 to 6.
The interval
type has an additional option, which is
to restrict the set of stored fields by writing one of these phrases:
YEAR MONTH DAY HOUR MINUTE SECOND YEAR TO MONTH DAY TO HOUR DAY TO MINUTE DAY TO SECOND HOUR TO MINUTE HOUR TO SECOND MINUTE TO SECOND
Note that if both fields
and
p
are specified, the
fields
must include SECOND
,
since the precision applies only to the seconds.
The type time with time zone
is defined by the SQL
standard, but the definition exhibits properties which lead to
questionable usefulness. In most cases, a combination of
date
, time
, timestamp without time
zone
, and timestamp with time zone
should
provide a complete range of date/time functionality required by
any application.
Date and time input is accepted in almost any reasonable format, including
ISO 8601, SQL-compatible,
traditional POSTGRES, and others.
For some formats, ordering of day, month, and year in date input is
ambiguous and there is support for specifying the expected
ordering of these fields. Set the DateStyle parameter
to MDY
to select month-day-year interpretation,
DMY
to select day-month-year interpretation, or
YMD
to select year-month-day interpretation.
PostgreSQL is more flexible in handling date/time input than the SQL standard requires. See Appendix B for the exact parsing rules of date/time input and for the recognized text fields including months, days of the week, and time zones.
Remember that any date or time literal input needs to be enclosed in single quotes, like text strings. Refer to Section 4.1.2.7 for more information. SQL requires the following syntax
type
[ (p
) ] 'value
'
where p
is an optional precision
specification giving the number of
fractional digits in the seconds field. Precision can be
specified for time
, timestamp
, and
interval
types, and can range from 0 to 6.
If no precision is specified in a constant specification,
it defaults to the precision of the literal value (but not
more than 6 digits).
Table 8.10 shows some possible
inputs for the date
type.
Table 8.10. Date Input
Example | Description |
---|---|
1999-01-08 | ISO 8601; January 8 in any mode (recommended format) |
January 8, 1999 | unambiguous in any datestyle input mode |
1/8/1999 | January 8 in MDY mode;
August 1 in DMY mode |
1/18/1999 | January 18 in MDY mode;
rejected in other modes |
01/02/03 | January 2, 2003 in MDY mode;
February 1, 2003 in DMY mode;
February 3, 2001 in YMD mode
|
1999-Jan-08 | January 8 in any mode |
Jan-08-1999 | January 8 in any mode |
08-Jan-1999 | January 8 in any mode |
99-Jan-08 | January 8 in YMD mode, else error |
08-Jan-99 | January 8, except error in YMD mode |
Jan-08-99 | January 8, except error in YMD mode |
19990108 | ISO 8601; January 8, 1999 in any mode |
990108 | ISO 8601; January 8, 1999 in any mode |
1999.008 | year and day of year |
J2451187 | Julian date |
January 8, 99 BC | year 99 BC |
The time-of-day types are time [
(
and
p
) ] without time zonetime [ (
. p
) ] with time
zonetime
alone is equivalent to
time without time zone
.
Valid input for these types consists of a time of day followed
by an optional time zone. (See Table 8.11
and Table 8.12.) If a time zone is
specified in the input for time without time zone
,
it is silently ignored. You can also specify a date but it will
be ignored, except when you use a time zone name that involves a
daylight-savings rule, such as
America/New_York
. In this case specifying the date
is required in order to determine whether standard or daylight-savings
time applies. The appropriate time zone offset is recorded in the
time with time zone
value and is output as stored;
it is not adjusted to the active time zone.
Table 8.11. Time Input
Example | Description |
---|---|
04:05:06.789 | ISO 8601 |
04:05:06 | ISO 8601 |
04:05 | ISO 8601 |
040506 | ISO 8601 |
04:05 AM | same as 04:05; AM does not affect value |
04:05 PM | same as 16:05; input hour must be <= 12 |
04:05:06.789-8 | ISO 8601, with time zone as UTC offset |
04:05:06-08:00 | ISO 8601, with time zone as UTC offset |
04:05-08:00 | ISO 8601, with time zone as UTC offset |
040506-08 | ISO 8601, with time zone as UTC offset |
040506+0730 | ISO 8601, with fractional-hour time zone as UTC offset |
040506+07:30:00 | UTC offset specified to seconds (not allowed in ISO 8601) |
04:05:06 PST | time zone specified by abbreviation |
2003-04-12 04:05:06 America/New_York | time zone specified by full name |
Table 8.12. Time Zone Input
Example | Description |
---|---|
PST | Abbreviation (for Pacific Standard Time) |
America/New_York | Full time zone name |
PST8PDT | POSIX-style time zone specification |
-8:00:00 | UTC offset for PST |
-8:00 | UTC offset for PST (ISO 8601 extended format) |
-800 | UTC offset for PST (ISO 8601 basic format) |
-8 | UTC offset for PST (ISO 8601 basic format) |
zulu | Military abbreviation for UTC |
z | Short form of zulu (also in ISO 8601) |
Refer to Section 8.5.3 for more information on how to specify time zones.
Valid input for the time stamp types consists of the concatenation
of a date and a time, followed by an optional time zone,
followed by an optional AD
or BC
.
(Alternatively, AD
/BC
can appear
before the time zone, but this is not the preferred ordering.)
Thus:
1999-01-08 04:05:06
and:
1999-01-08 04:05:06 -8:00
are valid values, which follow the ISO 8601 standard. In addition, the common format:
January 8 04:05:06 1999 PST
is supported.
The SQL standard differentiates
timestamp without time zone
and timestamp with time zone
literals by the presence of a
“+” or “-” symbol and time zone offset after
the time. Hence, according to the standard,
TIMESTAMP '2004-10-19 10:23:54'
is a timestamp without time zone
, while
TIMESTAMP '2004-10-19 10:23:54+02'
is a timestamp with time zone
.
PostgreSQL never examines the content of a
literal string before determining its type, and therefore will treat
both of the above as timestamp without time zone
. To
ensure that a literal is treated as timestamp with time
zone
, give it the correct explicit type:
TIMESTAMP WITH TIME ZONE '2004-10-19 10:23:54+02'
In a literal that has been determined to be timestamp without time
zone
, PostgreSQL will silently ignore
any time zone indication.
That is, the resulting value is derived from the date/time
fields in the input value, and is not adjusted for time zone.
For timestamp with time zone
, the internally stored
value is always in UTC (Universal
Coordinated Time, traditionally known as Greenwich Mean Time,
GMT). An input value that has an explicit
time zone specified is converted to UTC using the appropriate offset
for that time zone. If no time zone is stated in the input string,
then it is assumed to be in the time zone indicated by the system's
TimeZone parameter, and is converted to UTC using the
offset for the timezone
zone.
When a timestamp with time
zone
value is output, it is always converted from UTC to the
current timezone
zone, and displayed as local time in that
zone. To see the time in another time zone, either change
timezone
or use the AT TIME ZONE
construct
(see Section 9.9.4).
Conversions between timestamp without time zone
and
timestamp with time zone
normally assume that the
timestamp without time zone
value should be taken or given
as timezone
local time. A different time zone can
be specified for the conversion using AT TIME ZONE
.
PostgreSQL supports several
special date/time input values for convenience, as shown in Table 8.13. The values
infinity
and -infinity
are specially represented inside the system and will be displayed
unchanged; but the others are simply notational shorthands
that will be converted to ordinary date/time values when read.
(In particular, now
and related strings are converted
to a specific time value as soon as they are read.)
All of these values need to be enclosed in single quotes when used
as constants in SQL commands.
Table 8.13. Special Date/Time Inputs
Input String | Valid Types | Description |
---|---|---|
epoch | date , timestamp | 1970-01-01 00:00:00+00 (Unix system time zero) |
infinity | date , timestamp | later than all other time stamps |
-infinity | date , timestamp | earlier than all other time stamps |
now | date , time , timestamp | current transaction's start time |
today | date , timestamp | midnight (00:00 ) today |
tomorrow | date , timestamp | midnight (00:00 ) tomorrow |
yesterday | date , timestamp | midnight (00:00 ) yesterday |
allballs | time | 00:00:00.00 UTC |
The following SQL-compatible functions can also
be used to obtain the current time value for the corresponding data
type:
CURRENT_DATE
, CURRENT_TIME
,
CURRENT_TIMESTAMP
, LOCALTIME
,
LOCALTIMESTAMP
. (See Section 9.9.5.) Note that these are
SQL functions and are not recognized in data input strings.
While the input strings now
,
today
, tomorrow
,
and yesterday
are fine to use in interactive SQL
commands, they can have surprising behavior when the command is
saved to be executed later, for example in prepared statements,
views, and function definitions. The string can be converted to a
specific time value that continues to be used long after it becomes
stale. Use one of the SQL functions instead in such contexts.
For example, CURRENT_DATE + 1
is safer than
'tomorrow'::date
.
The output format of the date/time types can be set to one of the four
styles ISO 8601,
SQL (Ingres), traditional POSTGRES
(Unix date format), or
German. The default
is the ISO format. (The
SQL standard requires the use of the ISO 8601
format. The name of the “SQL” output format is a
historical accident.) Table 8.14 shows examples of each
output style. The output of the date
and
time
types is generally only the date or time part
in accordance with the given examples. However, the
POSTGRES style outputs date-only values in
ISO format.
Table 8.14. Date/Time Output Styles
Style Specification | Description | Example |
---|---|---|
ISO | ISO 8601, SQL standard | 1997-12-17 07:37:16-08 |
SQL | traditional style | 12/17/1997 07:37:16.00 PST |
Postgres | original style | Wed Dec 17 07:37:16 1997 PST |
German | regional style | 17.12.1997 07:37:16.00 PST |
ISO 8601 specifies the use of uppercase letter T
to separate
the date and time. PostgreSQL accepts that format on
input, but on output it uses a space rather than T
, as shown
above. This is for readability and for consistency with
RFC 3339 as
well as some other database systems.
In the SQL and POSTGRES styles, day appears before month if DMY field ordering has been specified, otherwise month appears before day. (See Section 8.5.1 for how this setting also affects interpretation of input values.) Table 8.15 shows examples.
Table 8.15. Date Order Conventions
datestyle Setting | Input Ordering | Example Output |
---|---|---|
SQL, DMY | day /month /year | 17/12/1997 15:37:16.00 CET |
SQL, MDY | month /day /year | 12/17/1997 07:37:16.00 PST |
Postgres, DMY | day /month /year | Wed 17 Dec 07:37:16 1997 PST |
In the ISO style, the time zone is always shown as
a signed numeric offset from UTC, with positive sign used for zones
east of Greenwich. The offset will be shown
as hh
(hours only) if it is an integral
number of hours, else
as hh
:mm
if it
is an integral number of minutes, else as
hh
:mm
:ss
.
(The third case is not possible with any modern time zone standard,
but it can appear when working with timestamps that predate the
adoption of standardized time zones.)
In the other date styles, the time zone is shown as an alphabetic
abbreviation if one is in common use in the current zone. Otherwise
it appears as a signed numeric offset in ISO 8601 basic format
(hh
or hhmm
).
The date/time style can be selected by the user using the
SET datestyle
command, the DateStyle parameter in the
postgresql.conf
configuration file, or the
PGDATESTYLE
environment variable on the server or
client.
The formatting function to_char
(see Section 9.8) is also available as
a more flexible way to format date/time output.
Time zones, and time-zone conventions, are influenced by political decisions, not just earth geometry. Time zones around the world became somewhat standardized during the 1900s, but continue to be prone to arbitrary changes, particularly with respect to daylight-savings rules. PostgreSQL uses the widely-used IANA (Olson) time zone database for information about historical time zone rules. For times in the future, the assumption is that the latest known rules for a given time zone will continue to be observed indefinitely far into the future.
PostgreSQL endeavors to be compatible with the SQL standard definitions for typical usage. However, the SQL standard has an odd mix of date and time types and capabilities. Two obvious problems are:
Although the date
type
cannot have an associated time zone, the
time
type can.
Time zones in the real world have little meaning unless
associated with a date as well as a time,
since the offset can vary through the year with daylight-saving
time boundaries.
The default time zone is specified as a constant numeric offset from UTC. It is therefore impossible to adapt to daylight-saving time when doing date/time arithmetic across DST boundaries.
To address these difficulties, we recommend using date/time types
that contain both date and time when using time zones. We
do not recommend using the type time with
time zone
(though it is supported by
PostgreSQL for legacy applications and
for compliance with the SQL standard).
PostgreSQL assumes
your local time zone for any type containing only date or time.
All timezone-aware dates and times are stored internally in UTC. They are converted to local time in the zone specified by the TimeZone configuration parameter before being displayed to the client.
PostgreSQL allows you to specify time zones in three different forms:
A full time zone name, for example America/New_York
.
The recognized time zone names are listed in the
pg_timezone_names
view (see Section 52.94).
PostgreSQL uses the widely-used IANA
time zone data for this purpose, so the same time zone
names are also recognized by other software.
A time zone abbreviation, for example PST
. Such a
specification merely defines a particular offset from UTC, in
contrast to full time zone names which can imply a set of daylight
savings transition rules as well. The recognized abbreviations
are listed in the pg_timezone_abbrevs
view (see Section 52.93). You cannot set the
configuration parameters TimeZone or
log_timezone to a time
zone abbreviation, but you can use abbreviations in
date/time input values and with the AT TIME ZONE
operator.
In addition to the timezone names and abbreviations, PostgreSQL will accept POSIX-style time zone specifications, as described in Section B.5. This option is not normally preferable to using a named time zone, but it may be necessary if no suitable IANA time zone entry is available.
In short, this is the difference between abbreviations
and full names: abbreviations represent a specific offset from UTC,
whereas many of the full names imply a local daylight-savings time
rule, and so have two possible UTC offsets. As an example,
2014-06-04 12:00 America/New_York
represents noon local
time in New York, which for this particular date was Eastern Daylight
Time (UTC-4). So 2014-06-04 12:00 EDT
specifies that
same time instant. But 2014-06-04 12:00 EST
specifies
noon Eastern Standard Time (UTC-5), regardless of whether daylight
savings was nominally in effect on that date.
To complicate matters, some jurisdictions have used the same timezone
abbreviation to mean different UTC offsets at different times; for
example, in Moscow MSK
has meant UTC+3 in some years and
UTC+4 in others. PostgreSQL interprets such
abbreviations according to whatever they meant (or had most recently
meant) on the specified date; but, as with the EST
example
above, this is not necessarily the same as local civil time on that date.
In all cases, timezone names and abbreviations are recognized case-insensitively. (This is a change from PostgreSQL versions prior to 8.2, which were case-sensitive in some contexts but not others.)
Neither timezone names nor abbreviations are hard-wired into the server;
they are obtained from configuration files stored under
.../share/timezone/
and .../share/timezonesets/
of the installation directory
(see Section B.4).
The TimeZone configuration parameter can
be set in the file postgresql.conf
, or in any of the
other standard ways described in Chapter 20.
There are also some special ways to set it:
The SQL command SET TIME ZONE
sets the time zone for the session. This is an alternative spelling
of SET TIMEZONE TO
with a more SQL-spec-compatible syntax.
The PGTZ
environment variable is used by
libpq clients
to send a SET TIME ZONE
command to the server upon connection.
interval
values can be written using the following
verbose syntax:
[@]quantity
unit
[quantity
unit
...] [direction
]
where quantity
is a number (possibly signed);
unit
is microsecond
,
millisecond
, second
,
minute
, hour
, day
,
week
, month
, year
,
decade
, century
, millennium
,
or abbreviations or plurals of these units;
direction
can be ago
or
empty. The at sign (@
) is optional noise. The amounts
of the different units are implicitly added with appropriate
sign accounting. ago
negates all the fields.
This syntax is also used for interval output, if
IntervalStyle is set to
postgres_verbose
.
Quantities of days, hours, minutes, and seconds can be specified without
explicit unit markings. For example, '1 12:59:10'
is read
the same as '1 day 12 hours 59 min 10 sec'
. Also,
a combination of years and months can be specified with a dash;
for example '200-10'
is read the same as '200 years
10 months'
. (These shorter forms are in fact the only ones allowed
by the SQL standard, and are used for output when
IntervalStyle
is set to sql_standard
.)
Interval values can also be written as ISO 8601 time intervals, using either the “format with designators” of the standard's section 4.4.3.2 or the “alternative format” of section 4.4.3.3. The format with designators looks like this:
Pquantity
unit
[quantity
unit
...] [ T [quantity
unit
...]]
The string must start with a P
, and may include a
T
that introduces the time-of-day units. The
available unit abbreviations are given in Table 8.16. Units may be
omitted, and may be specified in any order, but units smaller than
a day must appear after T
. In particular, the meaning of
M
depends on whether it is before or after
T
.
Table 8.16. ISO 8601 Interval Unit Abbreviations
Abbreviation | Meaning |
---|---|
Y | Years |
M | Months (in the date part) |
W | Weeks |
D | Days |
H | Hours |
M | Minutes (in the time part) |
S | Seconds |
In the alternative format:
P [years
-months
-days
] [ Thours
:minutes
:seconds
]
the string must begin with P
, and a
T
separates the date and time parts of the interval.
The values are given as numbers similar to ISO 8601 dates.
When writing an interval constant with a fields
specification, or when assigning a string to an interval column that was
defined with a fields
specification, the interpretation of
unmarked quantities depends on the fields
. For
example INTERVAL '1' YEAR
is read as 1 year, whereas
INTERVAL '1'
means 1 second. Also, field values
“to the right” of the least significant field allowed by the
fields
specification are silently discarded. For
example, writing INTERVAL '1 day 2:03:04' HOUR TO MINUTE
results in dropping the seconds field, but not the day field.
According to the SQL standard all fields of an interval
value must have the same sign, so a leading negative sign applies to all
fields; for example the negative sign in the interval literal
'-1 2:03:04'
applies to both the days and hour/minute/second
parts. PostgreSQL allows the fields to have different
signs, and traditionally treats each field in the textual representation
as independently signed, so that the hour/minute/second part is
considered positive in this example. If IntervalStyle
is
set to sql_standard
then a leading sign is considered
to apply to all fields (but only if no additional signs appear).
Otherwise the traditional PostgreSQL interpretation is
used. To avoid ambiguity, it's recommended to attach an explicit sign
to each field if any field is negative.
Internally, interval
values are stored as three integral
fields: months, days, and microseconds. These fields are kept
separate because the number of days in a month varies, while a day
can have 23 or 25 hours if a daylight savings time transition is
involved. An interval input string that uses other units is
normalized into this format, and then reconstructed in a standardized
way for output, for example:
SELECT '2 years 15 months 100 weeks 99 hours 123456789 milliseconds'::interval; interval --------------------------------------- 3 years 3 mons 700 days 133:17:36.789
Here weeks, which are understood as “7 days”, have been kept separate, while the smaller and larger time units were combined and normalized.
Input field values can have fractional parts, for example '1.5
weeks'
or '01:02:03.45'
. However,
because interval
internally stores only integral fields,
fractional values must be converted into smaller
units. Fractional parts of units greater than months are truncated to
be an integer number of months, e.g. '1.5 years'
becomes '1 year 6 mons'
. Fractional parts of
weeks and days are computed to be an integer number of days and
microseconds, assuming 30 days per month and 24 hours per day, e.g.,
'1.75 months'
becomes 1 mon 22 days
12:00:00
. Only seconds will ever be shown as fractional
on output.
Table 8.17 shows some examples
of valid interval
input.
Table 8.17. Interval Input
Example | Description |
---|---|
1-2 | SQL standard format: 1 year 2 months |
3 4:05:06 | SQL standard format: 3 days 4 hours 5 minutes 6 seconds |
1 year 2 months 3 days 4 hours 5 minutes 6 seconds | Traditional Postgres format: 1 year 2 months 3 days 4 hours 5 minutes 6 seconds |
P1Y2M3DT4H5M6S | ISO 8601 “format with designators”: same meaning as above |
P0001-02-03T04:05:06 | ISO 8601 “alternative format”: same meaning as above |
As previously explained, PostgreSQL
stores interval
values as months, days, and
microseconds. For output, the months field is converted to years and
months by dividing by 12. The days field is shown as-is. The
microseconds field is converted to hours, minutes, seconds, and
fractional seconds. Thus months, minutes, and seconds will never be
shown as exceeding the ranges 0–11, 0–59, and 0–59
respectively, while the displayed years, days, and hours fields can
be quite large. (The justify_days
and justify_hours
functions can be used if it is desirable to transpose large days or
hours values into the next higher field.)
The output format of the interval type can be set to one of the
four styles sql_standard
, postgres
,
postgres_verbose
, or iso_8601
,
using the command SET intervalstyle
.
The default is the postgres
format.
Table 8.18 shows examples of each
output style.
The sql_standard
style produces output that conforms to
the SQL standard's specification for interval literal strings, if
the interval value meets the standard's restrictions (either year-month
only or day-time only, with no mixing of positive
and negative components). Otherwise the output looks like a standard
year-month literal string followed by a day-time literal string,
with explicit signs added to disambiguate mixed-sign intervals.
The output of the postgres
style matches the output of
PostgreSQL releases prior to 8.4 when the
DateStyle parameter was set to ISO
.
The output of the postgres_verbose
style matches the output of
PostgreSQL releases prior to 8.4 when the
DateStyle
parameter was set to non-ISO
output.
The output of the iso_8601
style matches the “format
with designators” described in section 4.4.3.2 of the
ISO 8601 standard.
Table 8.18. Interval Output Style Examples
Style Specification | Year-Month Interval | Day-Time Interval | Mixed Interval |
---|---|---|---|
sql_standard | 1-2 | 3 4:05:06 | -1-2 +3 -4:05:06 |
postgres | 1 year 2 mons | 3 days 04:05:06 | -1 year -2 mons +3 days -04:05:06 |
postgres_verbose | @ 1 year 2 mons | @ 3 days 4 hours 5 mins 6 secs | @ 1 year 2 mons -3 days 4 hours 5 mins 6 secs ago |
iso_8601 | P1Y2M | P3DT4H5M6S | P-1Y-2M3DT-4H-5M-6S |
PostgreSQL provides the
standard SQL type boolean
;
see Table 8.19.
The boolean
type can have several states:
“true”, “false”, and a third state,
“unknown”, which is represented by the
SQL null value.
Table 8.19. Boolean Data Type
Name | Storage Size | Description |
---|---|---|
boolean | 1 byte | state of true or false |
Boolean constants can be represented in SQL queries by the SQL
key words TRUE
, FALSE
,
and NULL
.
The datatype input function for type boolean
accepts these
string representations for the “true” state:
true |
yes |
on |
1 |
and these representations for the “false” state:
false |
no |
off |
0 |
Unique prefixes of these strings are also accepted, for
example t
or n
.
Leading or trailing whitespace is ignored, and case does not matter.
The datatype output function for type boolean
always emits
either t
or f
, as shown in
Example 8.2.
Example 8.2. Using the boolean
Type
CREATE TABLE test1 (a boolean, b text); INSERT INTO test1 VALUES (TRUE, 'sic est'); INSERT INTO test1 VALUES (FALSE, 'non est'); SELECT * FROM test1; a | b ---+--------- t | sic est f | non est SELECT * FROM test1 WHERE a; a | b ---+--------- t | sic est
The key words TRUE
and FALSE
are
the preferred (SQL-compliant) method for writing
Boolean constants in SQL queries. But you can also use the string
representations by following the generic string-literal constant syntax
described in Section 4.1.2.7, for
example 'yes'::boolean
.
Note that the parser automatically understands
that TRUE
and FALSE
are of
type boolean
, but this is not so
for NULL
because that can have any type.
So in some contexts you might have to cast NULL
to boolean
explicitly, for
example NULL::boolean
. Conversely, the cast can be
omitted from a string-literal Boolean value in contexts where the parser
can deduce that the literal must be of type boolean
.
Enumerated (enum) types are data types that
comprise a static, ordered set of values.
They are equivalent to the enum
types supported in a number of programming languages. An example of an enum
type might be the days of the week, or a set of status values for
a piece of data.
Enum types are created using the CREATE TYPE command, for example:
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
Once created, the enum type can be used in table and function definitions much like any other type:
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy'); CREATE TABLE person ( name text, current_mood mood ); INSERT INTO person VALUES ('Moe', 'happy'); SELECT * FROM person WHERE current_mood = 'happy'; name | current_mood ------+-------------- Moe | happy (1 row)
The ordering of the values in an enum type is the order in which the values were listed when the type was created. All standard comparison operators and related aggregate functions are supported for enums. For example:
INSERT INTO person VALUES ('Larry', 'sad'); INSERT INTO person VALUES ('Curly', 'ok'); SELECT * FROM person WHERE current_mood > 'sad'; name | current_mood -------+-------------- Moe | happy Curly | ok (2 rows) SELECT * FROM person WHERE current_mood > 'sad' ORDER BY current_mood; name | current_mood -------+-------------- Curly | ok Moe | happy (2 rows) SELECT name FROM person WHERE current_mood = (SELECT MIN(current_mood) FROM person); name ------- Larry (1 row)
Each enumerated data type is separate and cannot be compared with other enumerated types. See this example:
CREATE TYPE happiness AS ENUM ('happy', 'very happy', 'ecstatic'); CREATE TABLE holidays ( num_weeks integer, happiness happiness ); INSERT INTO holidays(num_weeks,happiness) VALUES (4, 'happy'); INSERT INTO holidays(num_weeks,happiness) VALUES (6, 'very happy'); INSERT INTO holidays(num_weeks,happiness) VALUES (8, 'ecstatic'); INSERT INTO holidays(num_weeks,happiness) VALUES (2, 'sad'); ERROR: invalid input value for enum happiness: "sad" SELECT person.name, holidays.num_weeks FROM person, holidays WHERE person.current_mood = holidays.happiness; ERROR: operator does not exist: mood = happiness
If you really need to do something like that, you can either write a custom operator or add explicit casts to your query:
SELECT person.name, holidays.num_weeks FROM person, holidays WHERE person.current_mood::text = holidays.happiness::text; name | num_weeks ------+----------- Moe | 4 (1 row)
Enum labels are case sensitive, so
'happy'
is not the same as 'HAPPY'
.
White space in the labels is significant too.
Although enum types are primarily intended for static sets of values, there is support for adding new values to an existing enum type, and for renaming values (see ALTER TYPE). Existing values cannot be removed from an enum type, nor can the sort ordering of such values be changed, short of dropping and re-creating the enum type.
An enum value occupies four bytes on disk. The length of an enum
value's textual label is limited by the NAMEDATALEN
setting compiled into PostgreSQL; in standard
builds this means at most 63 bytes.
The translations from internal enum values to textual labels are
kept in the system catalog
pg_enum
.
Querying this catalog directly can be useful.
Geometric data types represent two-dimensional spatial objects. Table 8.20 shows the geometric types available in PostgreSQL.
Table 8.20. Geometric Types
Name | Storage Size | Description | Representation |
---|---|---|---|
point | 16 bytes | Point on a plane | (x,y) |
line | 24 bytes | Infinite line | {A,B,C} |
lseg | 32 bytes | Finite line segment | ((x1,y1),(x2,y2)) |
box | 32 bytes | Rectangular box | ((x1,y1),(x2,y2)) |
path | 16+16n bytes | Closed path (similar to polygon) | ((x1,y1),...) |
path | 16+16n bytes | Open path | [(x1,y1),...] |
polygon | 40+16n bytes | Polygon (similar to closed path) | ((x1,y1),...) |
circle | 24 bytes | Circle | <(x,y),r> (center point and radius) |
In all these types, the individual coordinates are stored as
double precision
(float8
) numbers.
A rich set of functions and operators is available to perform various geometric operations such as scaling, translation, rotation, and determining intersections. They are explained in Section 9.11.
Points are the fundamental two-dimensional building block for geometric
types. Values of type point
are specified using either of
the following syntaxes:
(x
,y
)x
,y
where x
and y
are the respective
coordinates, as floating-point numbers.
Points are output using the first syntax.
Lines are represented by the linear
equation A
x + B
y + C
= 0,
where A
and B
are not both zero. Values
of type line
are input and output in the following form:
{A
,B
,C
}
Alternatively, any of the following forms can be used for input:
[ (x1
,y1
) , (x2
,y2
) ] ( (x1
,y1
) , (x2
,y2
) ) (x1
,y1
) , (x2
,y2
)x1
,y1
,x2
,y2
where
(
and
x1
,y1
)(
are two different points on the line.
x2
,y2
)
Line segments are represented by pairs of points that are the endpoints
of the segment. Values of type lseg
are specified using any
of the following syntaxes:
[ (x1
,y1
) , (x2
,y2
) ] ( (x1
,y1
) , (x2
,y2
) ) (x1
,y1
) , (x2
,y2
)x1
,y1
,x2
,y2
where
(
and
x1
,y1
)(
are the end points of the line segment.
x2
,y2
)
Line segments are output using the first syntax.
Boxes are represented by pairs of points that are opposite
corners of the box.
Values of type box
are specified using any of the following
syntaxes:
( (x1
,y1
) , (x2
,y2
) ) (x1
,y1
) , (x2
,y2
)x1
,y1
,x2
,y2
where
(
and
x1
,y1
)(
are any two opposite corners of the box.
x2
,y2
)
Boxes are output using the second syntax.
Any two opposite corners can be supplied on input, but the values will be reordered as needed to store the upper right and lower left corners, in that order.
Paths are represented by lists of connected points. Paths can be open, where the first and last points in the list are considered not connected, or closed, where the first and last points are considered connected.
Values of type path
are specified using any of the following
syntaxes:
[ (x1
,y1
) , ... , (xn
,yn
) ] ( (x1
,y1
) , ... , (xn
,yn
) ) (x1
,y1
) , ... , (xn
,yn
) (x1
,y1
, ... ,xn
,yn
)x1
,y1
, ... ,xn
,yn
where the points are the end points of the line segments
comprising the path. Square brackets ([]
) indicate
an open path, while parentheses (()
) indicate a
closed path. When the outermost parentheses are omitted, as
in the third through fifth syntaxes, a closed path is assumed.
Paths are output using the first or second syntax, as appropriate.
Polygons are represented by lists of points (the vertexes of the polygon). Polygons are very similar to closed paths; the essential semantic difference is that a polygon is considered to include the area within it, while a path is not.
An important implementation difference between polygons and paths is that the stored representation of a polygon includes its smallest bounding box. This speeds up certain search operations, although computing the bounding box adds overhead while constructing new polygons.
Values of type polygon
are specified using any of the
following syntaxes:
( (x1
,y1
) , ... , (xn
,yn
) ) (x1
,y1
) , ... , (xn
,yn
) (x1
,y1
, ... ,xn
,yn
)x1
,y1
, ... ,xn
,yn
where the points are the end points of the line segments comprising the boundary of the polygon.
Polygons are output using the first syntax.
Circles are represented by a center point and radius.
Values of type circle
are specified using any of the
following syntaxes:
< (x
,y
) ,r
> ( (x
,y
) ,r
) (x
,y
) ,r
x
,y
,r
where
(
is the center point and x
,y
)r
is the radius of the
circle.
Circles are output using the first syntax.
PostgreSQL offers data types to store IPv4, IPv6, and MAC addresses, as shown in Table 8.21. It is better to use these types instead of plain text types to store network addresses, because these types offer input error checking and specialized operators and functions (see Section 9.12).
Table 8.21. Network Address Types
Name | Storage Size | Description |
---|---|---|
cidr | 7 or 19 bytes | IPv4 and IPv6 networks |
inet | 7 or 19 bytes | IPv4 and IPv6 hosts and networks |
macaddr | 6 bytes | MAC addresses |
macaddr8 | 8 bytes | MAC addresses (EUI-64 format) |
When sorting inet
or cidr
data types,
IPv4 addresses will always sort before IPv6 addresses, including
IPv4 addresses encapsulated or mapped to IPv6 addresses, such as
::10.2.3.4 or ::ffff:10.4.3.2.
inet
The inet
type holds an IPv4 or IPv6 host address, and
optionally its subnet, all in one field.
The subnet is represented by the number of network address bits
present in the host address (the
“netmask”). If the netmask is 32 and the address is IPv4,
then the value does not indicate a subnet, only a single host.
In IPv6, the address length is 128 bits, so 128 bits specify a
unique host address. Note that if you
want to accept only networks, you should use the
cidr
type rather than inet
.
The input format for this type is
address/y
where
address
is an IPv4 or IPv6 address and
y
is the number of bits in the netmask. If the
/y
portion is omitted, the
netmask is taken to be 32 for IPv4 or 128 for IPv6,
so the value represents
just a single host. On display, the
/y
portion is suppressed if the netmask specifies a single host.
cidr
The cidr
type holds an IPv4 or IPv6 network specification.
Input and output formats follow Classless Internet Domain Routing
conventions.
The format for specifying networks is address/y
where address
is the network's lowest
address represented as an
IPv4 or IPv6 address, and y
is the number of bits in the netmask. If
y
is omitted, it is calculated
using assumptions from the older classful network numbering system, except
it will be at least large enough to include all of the octets
written in the input. It is an error to specify a network address
that has bits set to the right of the specified netmask.
Table 8.22 shows some examples.
Table 8.22. cidr
Type Input Examples
cidr Input | cidr Output |
|
---|---|---|
192.168.100.128/25 | 192.168.100.128/25 | 192.168.100.128/25 |
192.168/24 | 192.168.0.0/24 | 192.168.0/24 |
192.168/25 | 192.168.0.0/25 | 192.168.0.0/25 |
192.168.1 | 192.168.1.0/24 | 192.168.1/24 |
192.168 | 192.168.0.0/24 | 192.168.0/24 |
128.1 | 128.1.0.0/16 | 128.1/16 |
128 | 128.0.0.0/16 | 128.0/16 |
128.1.2 | 128.1.2.0/24 | 128.1.2/24 |
10.1.2 | 10.1.2.0/24 | 10.1.2/24 |
10.1 | 10.1.0.0/16 | 10.1/16 |
10 | 10.0.0.0/8 | 10/8 |
10.1.2.3/32 | 10.1.2.3/32 | 10.1.2.3/32 |
2001:4f8:3:ba::/64 | 2001:4f8:3:ba::/64 | 2001:4f8:3:ba/64 |
2001:4f8:3:ba:2e0:81ff:fe22:d1f1/128 | 2001:4f8:3:ba:2e0:81ff:fe22:d1f1/128 | 2001:4f8:3:ba:2e0:81ff:fe22:d1f1/128 |
::ffff:1.2.3.0/120 | ::ffff:1.2.3.0/120 | ::ffff:1.2.3/120 |
::ffff:1.2.3.0/128 | ::ffff:1.2.3.0/128 | ::ffff:1.2.3.0/128 |
inet
vs. cidr
The essential difference between inet
and cidr
data types is that inet
accepts values with nonzero bits to
the right of the netmask, whereas cidr
does not. For
example, 192.168.0.1/24
is valid for inet
but not for cidr
.
If you do not like the output format for inet
or
cidr
values, try the functions host
,
text
, and abbrev
.
macaddr
The macaddr
type stores MAC addresses, known for example
from Ethernet card hardware addresses (although MAC addresses are
used for other purposes as well). Input is accepted in the
following formats:
'08:00:2b:01:02:03' |
'08-00-2b-01-02-03' |
'08002b:010203' |
'08002b-010203' |
'0800.2b01.0203' |
'0800-2b01-0203' |
'08002b010203' |
These examples all specify the same address. Upper and
lower case is accepted for the digits
a
through f
. Output is always in the
first of the forms shown.
IEEE Standard 802-2001 specifies the second form shown (with hyphens) as the canonical form for MAC addresses, and specifies the first form (with colons) as used with bit-reversed, MSB-first notation, so that 08-00-2b-01-02-03 = 10:00:D4:80:40:C0. This convention is widely ignored nowadays, and it is relevant only for obsolete network protocols (such as Token Ring). PostgreSQL makes no provisions for bit reversal; all accepted formats use the canonical LSB order.
The remaining five input formats are not part of any standard.
macaddr8
The macaddr8
type stores MAC addresses in EUI-64
format, known for example from Ethernet card hardware addresses
(although MAC addresses are used for other purposes as well).
This type can accept both 6 and 8 byte length MAC addresses
and stores them in 8 byte length format. MAC addresses given
in 6 byte format will be stored in 8 byte length format with the
4th and 5th bytes set to FF and FE, respectively.
Note that IPv6 uses a modified EUI-64 format where the 7th bit
should be set to one after the conversion from EUI-48. The
function macaddr8_set7bit
is provided to make this
change.
Generally speaking, any input which is comprised of pairs of hex
digits (on byte boundaries), optionally separated consistently by
one of ':'
, '-'
or '.'
, is
accepted. The number of hex digits must be either 16 (8 bytes) or
12 (6 bytes). Leading and trailing whitespace is ignored.
The following are examples of input formats that are accepted:
'08:00:2b:01:02:03:04:05' |
'08-00-2b-01-02-03-04-05' |
'08002b:0102030405' |
'08002b-0102030405' |
'0800.2b01.0203.0405' |
'0800-2b01-0203-0405' |
'08002b01:02030405' |
'08002b0102030405' |
These examples all specify the same address. Upper and
lower case is accepted for the digits
a
through f
. Output is always in the
first of the forms shown.
The last six input formats shown above are not part of any standard.
To convert a traditional 48 bit MAC address in EUI-48 format to
modified EUI-64 format to be included as the host portion of an
IPv6 address, use macaddr8_set7bit
as shown:
SELECT macaddr8_set7bit('08:00:2b:01:02:03');
macaddr8_set7bit
-------------------------
0a:00:2b:ff:fe:01:02:03
(1 row)
Bit strings are strings of 1's and 0's. They can be used to store
or visualize bit masks. There are two SQL bit types:
bit(
and n
)bit
varying(
, where
n
)n
is a positive integer.
bit
type data must match the length
n
exactly; it is an error to attempt to
store shorter or longer bit strings. bit varying
data is
of variable length up to the maximum length
n
; longer strings will be rejected.
Writing bit
without a length is equivalent to
bit(1)
, while bit varying
without a length
specification means unlimited length.
If one explicitly casts a bit-string value to
bit(
, it will be truncated or
zero-padded on the right to be exactly n
)n
bits,
without raising an error. Similarly,
if one explicitly casts a bit-string value to
bit varying(
, it will be truncated
on the right if it is more than n
)n
bits.
Refer to Section 4.1.2.5 for information about the syntax of bit string constants. Bit-logical operators and string manipulation functions are available; see Section 9.6.
Example 8.3. Using the Bit String Types
CREATE TABLE test (a BIT(3), b BIT VARYING(5)); INSERT INTO test VALUES (B'101', B'00'); INSERT INTO test VALUES (B'10', B'101');ERROR: bit string length 2 does not match type bit(3)
INSERT INTO test VALUES (B'10'::bit(3), B'101'); SELECT * FROM test;a | b -----+----- 101 | 00 100 | 101
A bit string value requires 1 byte for each group of 8 bits, plus 5 or 8 bytes overhead depending on the length of the string (but long values may be compressed or moved out-of-line, as explained in Section 8.3 for character strings).
PostgreSQL provides two data types that
are designed to support full text search, which is the activity of
searching through a collection of natural-language documents
to locate those that best match a query.
The tsvector
type represents a document in a form optimized
for text search; the tsquery
type similarly represents
a text query.
Chapter 12 provides a detailed explanation of this
facility, and Section 9.13 summarizes the
related functions and operators.
tsvector
A tsvector
value is a sorted list of distinct
lexemes, which are words that have been
normalized to merge different variants of the same word
(see Chapter 12 for details). Sorting and
duplicate-elimination are done automatically during input, as shown in
this example:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector; tsvector ---------------------------------------------------- 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
To represent lexemes containing whitespace or punctuation, surround them with quotes:
SELECT $$the lexeme ' ' contains spaces$$::tsvector; tsvector ------------------------------------------- ' ' 'contains' 'lexeme' 'spaces' 'the'
(We use dollar-quoted string literals in this example and the next one to avoid the confusion of having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled:
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector; tsvector ------------------------------------------------ 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
Optionally, integer positions can be attached to lexemes:
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; tsvector ------------------------------------------------------------------------------- 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
A position normally indicates the source word's location in the document. Positional information can be used for proximity ranking. Position values can range from 1 to 16383; larger numbers are silently set to 16383. Duplicate positions for the same lexeme are discarded.
Lexemes that have positions can further be labeled with a
weight, which can be A
,
B
, C
, or D
.
D
is the default and hence is not shown on output:
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector; tsvector ---------------------------- 'a':1A 'cat':5 'fat':2B,4C
Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers.
It is important to understand that the
tsvector
type itself does not perform any word
normalization; it assumes the words it is given are normalized
appropriately for the application. For example,
SELECT 'The Fat Rats'::tsvector; tsvector -------------------- 'Fat' 'Rats' 'The'
For most English-text-searching applications the above words would
be considered non-normalized, but tsvector
doesn't care.
Raw document text should usually be passed through
to_tsvector
to normalize the words appropriately
for searching:
SELECT to_tsvector('english', 'The Fat Rats'); to_tsvector ----------------- 'fat':2 'rat':3
Again, see Chapter 12 for more detail.
tsquery
A tsquery
value stores lexemes that are to be
searched for, and can combine them using the Boolean operators
&
(AND), |
(OR), and
!
(NOT), as well as the phrase search operator
<->
(FOLLOWED BY). There is also a variant
<
of the FOLLOWED BY
operator, where N
>N
is an integer constant that
specifies the distance between the two lexemes being searched
for. <->
is equivalent to <1>
.
Parentheses can be used to enforce grouping of these operators.
In the absence of parentheses, !
(NOT) binds most tightly,
<->
(FOLLOWED BY) next most tightly, then
&
(AND), with |
(OR) binding
the least tightly.
Here are some examples:
SELECT 'fat & rat'::tsquery; tsquery --------------- 'fat' & 'rat' SELECT 'fat & (rat | cat)'::tsquery; tsquery --------------------------- 'fat' & ( 'rat' | 'cat' ) SELECT 'fat & rat & ! cat'::tsquery; tsquery ------------------------ 'fat' & 'rat' & !'cat'
Optionally, lexemes in a tsquery
can be labeled with
one or more weight letters, which restricts them to match only
tsvector
lexemes with one of those weights:
SELECT 'fat:ab & cat'::tsquery; tsquery ------------------ 'fat':AB & 'cat'
Also, lexemes in a tsquery
can be labeled with *
to specify prefix matching:
SELECT 'super:*'::tsquery; tsquery ----------- 'super':*
This query will match any word in a tsvector
that begins
with “super”.
Quoting rules for lexemes are the same as described previously for
lexemes in tsvector
; and, as with tsvector
,
any required normalization of words must be done before converting
to the tsquery
type. The to_tsquery
function is convenient for performing such normalization:
SELECT to_tsquery('Fat:ab & Cats'); to_tsquery ------------------ 'fat':AB & 'cat'
Note that to_tsquery
will process prefixes in the same way
as other words, which means this comparison returns true:
SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' ); ?column? ---------- t
because postgres
gets stemmed to postgr
:
SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' ); to_tsvector | to_tsquery ---------------+------------ 'postgradu':1 | 'postgr':*
which will match the stemmed form of postgraduate
.
The data type uuid
stores Universally Unique Identifiers
(UUID) as defined by RFC 4122,
ISO/IEC 9834-8:2005, and related standards.
(Some systems refer to this data type as a globally unique identifier, or
GUID, instead.) This
identifier is a 128-bit quantity that is generated by an algorithm chosen
to make it very unlikely that the same identifier will be generated by
anyone else in the known universe using the same algorithm. Therefore,
for distributed systems, these identifiers provide a better uniqueness
guarantee than sequence generators, which
are only unique within a single database.
A UUID is written as a sequence of lower-case hexadecimal digits, in several groups separated by hyphens, specifically a group of 8 digits followed by three groups of 4 digits followed by a group of 12 digits, for a total of 32 digits representing the 128 bits. An example of a UUID in this standard form is:
a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
PostgreSQL also accepts the following alternative forms for input: use of upper-case digits, the standard format surrounded by braces, omitting some or all hyphens, adding a hyphen after any group of four digits. Examples are:
A0EEBC99-9C0B-4EF8-BB6D-6BB9BD380A11 {a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11} a0eebc999c0b4ef8bb6d6bb9bd380a11 a0ee-bc99-9c0b-4ef8-bb6d-6bb9-bd38-0a11 {a0eebc99-9c0b4ef8-bb6d6bb9-bd380a11}
Output is always in the standard form.
See Section 9.14 for how to generate a UUID in PostgreSQL.
The xml
data type can be used to store XML data. Its
advantage over storing XML data in a text
field is that it
checks the input values for well-formedness, and there are support
functions to perform type-safe operations on it; see Section 9.15. Use of this data type requires the
installation to have been built with configure
--with-libxml
.
The xml
type can store well-formed
“documents”, as defined by the XML standard, as well
as “content” fragments, which are defined by reference
to the more permissive
“document node”
of the XQuery and XPath data model.
Roughly, this means that content fragments can have
more than one top-level element or character node. The expression
can be used to evaluate whether a particular xmlvalue
IS DOCUMENTxml
value is a full document or only a content fragment.
Limits and compatibility notes for the xml
data type
can be found in Section D.3.
To produce a value of type xml
from character data,
use the function
xmlparse
:
XMLPARSE ( { DOCUMENT | CONTENT } value
)
Examples:
XMLPARSE (DOCUMENT '<?xml version="1.0"?><book><title>Manual</title><chapter>...</chapter></book>') XMLPARSE (CONTENT 'abc<foo>bar</foo><bar>foo</bar>')
While this is the only way to convert character strings into XML values according to the SQL standard, the PostgreSQL-specific syntaxes:
xml '<foo>bar</foo>' '<foo>bar</foo>'::xml
can also be used.
The xml
type does not validate input values
against a document type declaration
(DTD),
even when the input value specifies a DTD.
There is also currently no built-in support for validating against
other XML schema languages such as XML Schema.
The inverse operation, producing a character string value from
xml
, uses the function
xmlserialize
:
XMLSERIALIZE ( { DOCUMENT | CONTENT }value
AStype
)
type
can be
character
, character varying
, or
text
(or an alias for one of those). Again, according
to the SQL standard, this is the only way to convert between type
xml
and character types, but PostgreSQL also allows
you to simply cast the value.
When a character string value is cast to or from type
xml
without going through XMLPARSE
or
XMLSERIALIZE
, respectively, the choice of
DOCUMENT
versus CONTENT
is
determined by the “XML option”
session configuration parameter, which can be set using the
standard command:
SET XML OPTION { DOCUMENT | CONTENT };
or the more PostgreSQL-like syntax
SET xmloption TO { DOCUMENT | CONTENT };
The default is CONTENT
, so all forms of XML
data are allowed.
Care must be taken when dealing with multiple character encodings
on the client, server, and in the XML data passed through them.
When using the text mode to pass queries to the server and query
results to the client (which is the normal mode), PostgreSQL
converts all character data passed between the client and the
server and vice versa to the character encoding of the respective
end; see Section 24.3. This includes string
representations of XML values, such as in the above examples.
This would ordinarily mean that encoding declarations contained in
XML data can become invalid as the character data is converted
to other encodings while traveling between client and server,
because the embedded encoding declaration is not changed. To cope
with this behavior, encoding declarations contained in
character strings presented for input to the xml
type
are ignored, and content is assumed
to be in the current server encoding. Consequently, for correct
processing, character strings of XML data must be sent
from the client in the current client encoding. It is the
responsibility of the client to either convert documents to the
current client encoding before sending them to the server, or to
adjust the client encoding appropriately. On output, values of
type xml
will not have an encoding declaration, and
clients should assume all data is in the current client
encoding.
When using binary mode to pass query parameters to the server and query results back to the client, no encoding conversion is performed, so the situation is different. In this case, an encoding declaration in the XML data will be observed, and if it is absent, the data will be assumed to be in UTF-8 (as required by the XML standard; note that PostgreSQL does not support UTF-16). On output, data will have an encoding declaration specifying the client encoding, unless the client encoding is UTF-8, in which case it will be omitted.
Needless to say, processing XML data with PostgreSQL will be less error-prone and more efficient if the XML data encoding, client encoding, and server encoding are the same. Since XML data is internally processed in UTF-8, computations will be most efficient if the server encoding is also UTF-8.
Some XML-related functions may not work at all on non-ASCII data
when the server encoding is not UTF-8. This is known to be an
issue for xmltable()
and xpath()
in particular.
The xml
data type is unusual in that it does not
provide any comparison operators. This is because there is no
well-defined and universally useful comparison algorithm for XML
data. One consequence of this is that you cannot retrieve rows by
comparing an xml
column against a search value. XML
values should therefore typically be accompanied by a separate key
field such as an ID. An alternative solution for comparing XML
values is to convert them to character strings first, but note
that character string comparison has little to do with a useful
XML comparison method.
Since there are no comparison operators for the xml
data type, it is not possible to create an index directly on a
column of this type. If speedy searches in XML data are desired,
possible workarounds include casting the expression to a
character string type and indexing that, or indexing an XPath
expression. Of course, the actual query would have to be adjusted
to search by the indexed expression.
The text-search functionality in PostgreSQL can also be used to speed up full-document searches of XML data. The necessary preprocessing support is, however, not yet available in the PostgreSQL distribution.
JSON data types are for storing JSON (JavaScript Object Notation)
data, as specified in RFC
7159. Such data can also be stored as text
, but
the JSON data types have the advantage of enforcing that each
stored value is valid according to the JSON rules. There are also
assorted JSON-specific functions and operators available for data stored
in these data types; see Section 9.16.
PostgreSQL offers two types for storing JSON
data: json
and jsonb
. To implement efficient query
mechanisms for these data types, PostgreSQL
also provides the jsonpath
data type described in
Section 8.14.7.
The json
and jsonb
data types
accept almost identical sets of values as
input. The major practical difference is one of efficiency. The
json
data type stores an exact copy of the input text,
which processing functions must reparse on each execution; while
jsonb
data is stored in a decomposed binary format that
makes it slightly slower to input due to added conversion
overhead, but significantly faster to process, since no reparsing
is needed. jsonb
also supports indexing, which can be a
significant advantage.
Because the json
type stores an exact copy of the input text, it
will preserve semantically-insignificant white space between tokens, as
well as the order of keys within JSON objects. Also, if a JSON object
within the value contains the same key more than once, all the key/value
pairs are kept. (The processing functions consider the last value as the
operative one.) By contrast, jsonb
does not preserve white
space, does not preserve the order of object keys, and does not keep
duplicate object keys. If duplicate keys are specified in the input,
only the last value is kept.
In general, most applications should prefer to store JSON data as
jsonb
, unless there are quite specialized needs, such as
legacy assumptions about ordering of object keys.
RFC 7159 specifies that JSON strings should be encoded in UTF8. It is therefore not possible for the JSON types to conform rigidly to the JSON specification unless the database encoding is UTF8. Attempts to directly include characters that cannot be represented in the database encoding will fail; conversely, characters that can be represented in the database encoding but not in UTF8 will be allowed.
RFC 7159 permits JSON strings to contain Unicode escape sequences
denoted by \u
. In the input
function for the XXXX
json
type, Unicode escapes are allowed
regardless of the database encoding, and are checked only for syntactic
correctness (that is, that four hex digits follow \u
).
However, the input function for jsonb
is stricter: it disallows
Unicode escapes for characters that cannot be represented in the database
encoding. The jsonb
type also
rejects \u0000
(because that cannot be represented in
PostgreSQL's text
type), and it insists
that any use of Unicode surrogate pairs to designate characters outside
the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes
are converted to the equivalent single character for storage;
this includes folding surrogate pairs into a single character.
Many of the JSON processing functions described
in Section 9.16 will convert Unicode escapes to
regular characters, and will therefore throw the same types of errors
just described even if their input is of type json
not jsonb
. The fact that the json
input function does
not make these checks may be considered a historical artifact, although
it does allow for simple storage (without processing) of JSON Unicode
escapes in a database encoding that does not support the represented
characters.
When converting textual JSON input into jsonb
, the primitive
types described by RFC 7159 are effectively mapped onto
native PostgreSQL types, as shown
in Table 8.23.
Therefore, there are some minor additional constraints on what
constitutes valid jsonb
data that do not apply to
the json
type, nor to JSON in the abstract, corresponding
to limits on what can be represented by the underlying data type.
Notably, jsonb
will reject numbers that are outside the
range of the PostgreSQL numeric
data
type, while json
will not. Such implementation-defined
restrictions are permitted by RFC 7159. However, in
practice such problems are far more likely to occur in other
implementations, as it is common to represent JSON's number
primitive type as IEEE 754 double precision floating point
(which RFC 7159 explicitly anticipates and allows for).
When using JSON as an interchange format with such systems, the danger
of losing numeric precision compared to data originally stored
by PostgreSQL should be considered.
Conversely, as noted in the table there are some minor restrictions on the input format of JSON primitive types that do not apply to the corresponding PostgreSQL types.
Table 8.23. JSON Primitive Types and Corresponding PostgreSQL Types
JSON primitive type | PostgreSQL type | Notes |
---|---|---|
string | text | \u0000 is disallowed, as are Unicode escapes
representing characters not available in the database encoding |
number | numeric | NaN and infinity values are disallowed |
boolean | boolean | Only lowercase true and false spellings are accepted |
null | (none) | SQL NULL is a different concept |
The input/output syntax for the JSON data types is as specified in RFC 7159.
The following are all valid json
(or jsonb
) expressions:
-- Simple scalar/primitive value -- Primitive values can be numbers, quoted strings, true, false, or null SELECT '5'::json; -- Array of zero or more elements (elements need not be of same type) SELECT '[1, 2, "foo", null]'::json; -- Object containing pairs of keys and values -- Note that object keys must always be quoted strings SELECT '{"bar": "baz", "balance": 7.77, "active": false}'::json; -- Arrays and objects can be nested arbitrarily SELECT '{"foo": [true, "bar"], "tags": {"a": 1, "b": null}}'::json;
As previously stated, when a JSON value is input and then printed without
any additional processing, json
outputs the same text that was
input, while jsonb
does not preserve semantically-insignificant
details such as whitespace. For example, note the differences here:
SELECT '{"bar": "baz", "balance": 7.77, "active":false}'::json; json ------------------------------------------------- {"bar": "baz", "balance": 7.77, "active":false} (1 row) SELECT '{"bar": "baz", "balance": 7.77, "active":false}'::jsonb; jsonb -------------------------------------------------- {"bar": "baz", "active": false, "balance": 7.77} (1 row)
One semantically-insignificant detail worth noting is that
in jsonb
, numbers will be printed according to the behavior of the
underlying numeric
type. In practice this means that numbers
entered with E
notation will be printed without it, for
example:
SELECT '{"reading": 1.230e-5}'::json, '{"reading": 1.230e-5}'::jsonb; json | jsonb -----------------------+------------------------- {"reading": 1.230e-5} | {"reading": 0.00001230} (1 row)
However, jsonb
will preserve trailing fractional zeroes, as seen
in this example, even though those are semantically insignificant for
purposes such as equality checks.
For the list of built-in functions and operators available for constructing and processing JSON values, see Section 9.16.
Representing data as JSON can be considerably more flexible than the traditional relational data model, which is compelling in environments where requirements are fluid. It is quite possible for both approaches to co-exist and complement each other within the same application. However, even for applications where maximal flexibility is desired, it is still recommended that JSON documents have a somewhat fixed structure. The structure is typically unenforced (though enforcing some business rules declaratively is possible), but having a predictable structure makes it easier to write queries that usefully summarize a set of “documents” (datums) in a table.
JSON data is subject to the same concurrency-control considerations as any other data type when stored in a table. Although storing large documents is practicable, keep in mind that any update acquires a row-level lock on the whole row. Consider limiting JSON documents to a manageable size in order to decrease lock contention among updating transactions. Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently.
jsonb
Containment and Existence
Testing containment is an important capability of
jsonb
. There is no parallel set of facilities for the
json
type. Containment tests whether
one jsonb
document has contained within it another one.
These examples return true except as noted:
-- Simple scalar/primitive values contain only the identical value:
SELECT '"foo"'::jsonb @> '"foo"'::jsonb;
-- The array on the right side is contained within the one on the left:
SELECT '[1, 2, 3]'::jsonb @> '[1, 3]'::jsonb;
-- Order of array elements is not significant, so this is also true:
SELECT '[1, 2, 3]'::jsonb @> '[3, 1]'::jsonb;
-- Duplicate array elements don't matter either:
SELECT '[1, 2, 3]'::jsonb @> '[1, 2, 2]'::jsonb;
-- The object with a single pair on the right side is contained
-- within the object on the left side:
SELECT '{"product": "PostgreSQL", "version": 9.4, "jsonb": true}'::jsonb @> '{"version": 9.4}'::jsonb;
-- The array on the right side is not considered contained within the
-- array on the left, even though a similar array is nested within it:
SELECT '[1, 2, [1, 3]]'::jsonb @> '[1, 3]'::jsonb; -- yields false
-- But with a layer of nesting, it is contained:
SELECT '[1, 2, [1, 3]]'::jsonb @> '[[1, 3]]'::jsonb;
-- Similarly, containment is not reported here:
SELECT '{"foo": {"bar": "baz"}}'::jsonb @> '{"bar": "baz"}'::jsonb; -- yields false
-- A top-level key and an empty object is contained:
SELECT '{"foo": {"bar": "baz"}}'::jsonb @> '{"foo": {}}'::jsonb;
The general principle is that the contained object must match the containing object as to structure and data contents, possibly after discarding some non-matching array elements or object key/value pairs from the containing object. But remember that the order of array elements is not significant when doing a containment match, and duplicate array elements are effectively considered only once.
As a special exception to the general principle that the structures must match, an array may contain a primitive value:
-- This array contains the primitive string value: SELECT '["foo", "bar"]'::jsonb @> '"bar"'::jsonb; -- This exception is not reciprocal -- non-containment is reported here: SELECT '"bar"'::jsonb @> '["bar"]'::jsonb; -- yields false
jsonb
also has an existence operator, which is
a variation on the theme of containment: it tests whether a string
(given as a text
value) appears as an object key or array
element at the top level of the jsonb
value.
These examples return true except as noted:
-- String exists as array element: SELECT '["foo", "bar", "baz"]'::jsonb ? 'bar'; -- String exists as object key: SELECT '{"foo": "bar"}'::jsonb ? 'foo'; -- Object values are not considered: SELECT '{"foo": "bar"}'::jsonb ? 'bar'; -- yields false -- As with containment, existence must match at the top level: SELECT '{"foo": {"bar": "baz"}}'::jsonb ? 'bar'; -- yields false -- A string is considered to exist if it matches a primitive JSON string: SELECT '"foo"'::jsonb ? 'foo';
JSON objects are better suited than arrays for testing containment or existence when there are many keys or elements involved, because unlike arrays they are internally optimized for searching, and do not need to be searched linearly.
Because JSON containment is nested, an appropriate query can skip
explicit selection of sub-objects. As an example, suppose that we have
a doc
column containing objects at the top level, with
most objects containing tags
fields that contain arrays of
sub-objects. This query finds entries in which sub-objects containing
both "term":"paris"
and "term":"food"
appear,
while ignoring any such keys outside the tags
array:
SELECT doc->'site_name' FROM websites WHERE doc @> '{"tags":[{"term":"paris"}, {"term":"food"}]}';
One could accomplish the same thing with, say,
SELECT doc->'site_name' FROM websites WHERE doc->'tags' @> '[{"term":"paris"}, {"term":"food"}]';
but that approach is less flexible, and often less efficient as well.
On the other hand, the JSON existence operator is not nested: it will only look for the specified key or array element at top level of the JSON value.
The various containment and existence operators, along with all other JSON operators and functions are documented in Section 9.16.
jsonb
Indexing
GIN indexes can be used to efficiently search for
keys or key/value pairs occurring within a large number of
jsonb
documents (datums).
Two GIN “operator classes” are provided, offering different
performance and flexibility trade-offs.
The default GIN operator class for jsonb
supports queries with
the key-exists operators ?
, ?|
and ?&
, the containment operator
@>
, and the jsonpath
match
operators @?
and @@
.
(For details of the semantics that these operators
implement, see Table 9.45.)
An example of creating an index with this operator class is:
CREATE INDEX idxgin ON api USING GIN (jdoc);
The non-default GIN operator class jsonb_path_ops
does not support the key-exists operators, but it does support
@>
, @?
and @@
.
An example of creating an index with this operator class is:
CREATE INDEX idxginp ON api USING GIN (jdoc jsonb_path_ops);
Consider the example of a table that stores JSON documents retrieved from a third-party web service, with a documented schema definition. A typical document is:
{ "guid": "9c36adc1-7fb5-4d5b-83b4-90356a46061a", "name": "Angela Barton", "is_active": true, "company": "Magnafone", "address": "178 Howard Place, Gulf, Washington, 702", "registered": "2009-11-07T08:53:22 +08:00", "latitude": 19.793713, "longitude": 86.513373, "tags": [ "enim", "aliquip", "qui" ] }
We store these documents in a table named api
,
in a jsonb
column named jdoc
.
If a GIN index is created on this column,
queries like the following can make use of the index:
-- Find documents in which the key "company" has value "Magnafone" SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"company": "Magnafone"}';
However, the index could not be used for queries like the
following, because though the operator ?
is indexable,
it is not applied directly to the indexed column jdoc
:
-- Find documents in which the key "tags" contains key or array element "qui" SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc -> 'tags' ? 'qui';
Still, with appropriate use of expression indexes, the above
query can use an index. If querying for particular items within
the "tags"
key is common, defining an index like this
may be worthwhile:
CREATE INDEX idxgintags ON api USING GIN ((jdoc -> 'tags'));
Now, the WHERE
clause jdoc -> 'tags' ? 'qui'
will be recognized as an application of the indexable
operator ?
to the indexed
expression jdoc -> 'tags'
.
(More information on expression indexes can be found in Section 11.7.)
Another approach to querying is to exploit containment, for example:
-- Find documents in which the key "tags" contains array element "qui" SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qui"]}';
A simple GIN index on the jdoc
column can support this
query. But note that such an index will store copies of every key and
value in the jdoc
column, whereas the expression index
of the previous example stores only data found under
the tags
key. While the simple-index approach is far more
flexible (since it supports queries about any key), targeted expression
indexes are likely to be smaller and faster to search than a simple
index.
GIN indexes also support the @?
and @@
operators, which
perform jsonpath
matching. Examples are
SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @? '$.tags[*] ? (@ == "qui")';
SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @@ '$.tags[*] == "qui"';
For these operators, a GIN index extracts clauses of the form
out of
the accessors_chain
= constant
jsonpath
pattern, and does the index search based on
the keys and values mentioned in these clauses. The accessors chain
may include .
,
key
[*]
,
and [
accessors.
The index
]jsonb_ops
operator class also
supports .*
and .**
accessors,
but the jsonb_path_ops
operator class does not.
Although the jsonb_path_ops
operator class supports
only queries with the @>
, @?
and @@
operators, it has notable
performance advantages over the default operator
class jsonb_ops
. A jsonb_path_ops
index is usually much smaller than a jsonb_ops
index over the same data, and the specificity of searches is better,
particularly when queries contain keys that appear frequently in the
data. Therefore search operations typically perform better
than with the default operator class.
The technical difference between a jsonb_ops
and a jsonb_path_ops
GIN index is that the former
creates independent index items for each key and value in the data,
while the latter creates index items only for each value in the
data.
[7]
Basically, each jsonb_path_ops
index item is
a hash of the value and the key(s) leading to it; for example to index
{"foo": {"bar": "baz"}}
, a single index item would
be created incorporating all three of foo
, bar
,
and baz
into the hash value. Thus a containment query
looking for this structure would result in an extremely specific index
search; but there is no way at all to find out whether foo
appears as a key. On the other hand, a jsonb_ops
index would create three index items representing foo
,
bar
, and baz
separately; then to do the
containment query, it would look for rows containing all three of
these items. While GIN indexes can perform such an AND search fairly
efficiently, it will still be less specific and slower than the
equivalent jsonb_path_ops
search, especially if
there are a very large number of rows containing any single one of the
three index items.
A disadvantage of the jsonb_path_ops
approach is
that it produces no index entries for JSON structures not containing
any values, such as {"a": {}}
. If a search for
documents containing such a structure is requested, it will require a
full-index scan, which is quite slow. jsonb_path_ops
is
therefore ill-suited for applications that often perform such searches.
jsonb
also supports btree
and hash
indexes. These are usually useful only if it's important to check
equality of complete JSON documents.
The btree
ordering for jsonb
datums is seldom
of great interest, but for completeness it is:
Object
>Array
>Boolean
>Number
>String
>Null
Object with n pairs
>object with n - 1 pairs
Array with n elements
>array with n - 1 elements
Objects with equal numbers of pairs are compared in the order:
key-1
,value-1
,key-2
...
Note that object keys are compared in their storage order; in particular, since shorter keys are stored before longer keys, this can lead to results that might be unintuitive, such as:
{ "aa": 1, "c": 1} > {"b": 1, "d": 1}
Similarly, arrays with equal numbers of elements are compared in the order:
element-1
,element-2
...
Primitive JSON values are compared using the same comparison rules as for the underlying PostgreSQL data type. Strings are compared using the default database collation.
jsonb
Subscripting
The jsonb
data type supports array-style subscripting expressions
to extract and modify elements. Nested values can be indicated by chaining
subscripting expressions, following the same rules as the path
argument in the jsonb_set
function. If a jsonb
value is an array, numeric subscripts start at zero, and negative integers count
backwards from the last element of the array. Slice expressions are not supported.
The result of a subscripting expression is always of the jsonb data type.
UPDATE
statements may use subscripting in the
SET
clause to modify jsonb
values. Subscript
paths must be traversable for all affected values insofar as they exist. For
instance, the path val['a']['b']['c']
can be traversed all
the way to c
if every val
,
val['a']
, and val['a']['b']
is an
object. If any val['a']
or val['a']['b']
is not defined, it will be created as an empty object and filled as
necessary. However, if any val
itself or one of the
intermediary values is defined as a non-object such as a string, number, or
jsonb
null
, traversal cannot proceed so
an error is raised and the transaction aborted.
An example of subscripting syntax:
-- Extract object value by key SELECT ('{"a": 1}'::jsonb)['a']; -- Extract nested object value by key path SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c']; -- Extract array element by index SELECT ('[1, "2", null]'::jsonb)[1]; -- Update object value by key. Note the quotes around '1': the assigned -- value must be of the jsonb type as well UPDATE table_name SET jsonb_field['key'] = '1'; -- This will raise an error if any record's jsonb_field['a']['b'] is something -- other than an object. For example, the value {"a": 1} has a numeric value -- of the key 'a'. UPDATE table_name SET jsonb_field['a']['b']['c'] = '1'; -- Filter records using a WHERE clause with subscripting. Since the result of -- subscripting is jsonb, the value we compare it against must also be jsonb. -- The double quotes make "value" also a valid jsonb string. SELECT * FROM table_name WHERE jsonb_field['key'] = '"value"';
jsonb
assignment via subscripting handles a few edge cases
differently from jsonb_set
. When a source jsonb
value is NULL
, assignment via subscripting will proceed
as if it was an empty JSON value of the type (object or array) implied by the
subscript key:
-- Where jsonb_field was NULL, it is now {"a": 1} UPDATE table_name SET jsonb_field['a'] = '1'; -- Where jsonb_field was NULL, it is now [1] UPDATE table_name SET jsonb_field[0] = '1';
If an index is specified for an array containing too few elements,
NULL
elements will be appended until the index is reachable
and the value can be set.
-- Where jsonb_field was [], it is now [null, null, 2]; -- where jsonb_field was [0], it is now [0, null, 2] UPDATE table_name SET jsonb_field[2] = '2';
A jsonb
value will accept assignments to nonexistent subscript
paths as long as the last existing element to be traversed is an object or
array, as implied by the corresponding subscript (the element indicated by
the last subscript in the path is not traversed and may be anything). Nested
array and object structures will be created, and in the former case
null
-padded, as specified by the subscript path until the
assigned value can be placed.
-- Where jsonb_field was {}, it is now {"a": [{"b": 1}]} UPDATE table_name SET jsonb_field['a'][0]['b'] = '1'; -- Where jsonb_field was [], it is now [null, {"a": 1}] UPDATE table_name SET jsonb_field[1]['a'] = '1';
Additional extensions are available that implement transforms for the
jsonb
type for different procedural languages.
The extensions for PL/Perl are called jsonb_plperl
and
jsonb_plperlu
. If you use them, jsonb
values are mapped to Perl arrays, hashes, and scalars, as appropriate.
The extensions for PL/Python are called jsonb_plpythonu
,
jsonb_plpython2u
, and
jsonb_plpython3u
(see Section 46.1 for the PL/Python naming convention). If you
use them, jsonb
values are mapped to Python dictionaries,
lists, and scalars, as appropriate.
Of these extensions, jsonb_plperl
is
considered “trusted”, that is, it can be installed by
non-superusers who have CREATE
privilege on the
current database. The rest require superuser privilege to install.
The jsonpath
type implements support for the SQL/JSON path language
in PostgreSQL to efficiently query JSON data.
It provides a binary representation of the parsed SQL/JSON path
expression that specifies the items to be retrieved by the path
engine from the JSON data for further processing with the
SQL/JSON query functions.
The semantics of SQL/JSON path predicates and operators generally follow SQL. At the same time, to provide a natural way of working with JSON data, SQL/JSON path syntax uses some JavaScript conventions:
Dot (.
) is used for member access.
Square brackets ([]
) are used for array access.
SQL/JSON arrays are 0-relative, unlike regular SQL arrays that start from 1.
An SQL/JSON path expression is typically written in an SQL query as an
SQL character string literal, so it must be enclosed in single quotes,
and any single quotes desired within the value must be doubled
(see Section 4.1.2.1).
Some forms of path expressions require string literals within them.
These embedded string literals follow JavaScript/ECMAScript conventions:
they must be surrounded by double quotes, and backslash escapes may be
used within them to represent otherwise-hard-to-type characters.
In particular, the way to write a double quote within an embedded string
literal is \"
, and to write a backslash itself, you
must write \\
. Other special backslash sequences
include those recognized in JavaScript strings:
\b
,
\f
,
\n
,
\r
,
\t
,
\v
for various ASCII control characters,
\x
for a character code
written with only two hex digits,
NN
\u
for a Unicode
character identified by its 4-hex-digit code point, and
NNNN
\u{
for a Unicode
character code point written with 1 to 6 hex digits.
N...
}
A path expression consists of a sequence of path elements, which can be any of the following:
Path literals of JSON primitive types: Unicode text, numeric, true, false, or null.
Path variables listed in Table 8.24.
Accessor operators listed in Table 8.25.
jsonpath
operators and methods listed
in Section 9.16.2.2.
Parentheses, which can be used to provide filter expressions or define the order of path evaluation.
For details on using jsonpath
expressions with SQL/JSON
query functions, see Section 9.16.2.
Table 8.24. jsonpath
Variables
Variable | Description |
---|---|
$ | A variable representing the JSON value being queried (the context item). |
$varname |
A named variable. Its value can be set by the parameter
vars of several JSON processing functions;
see Table 9.47 for details.
|
@ | A variable representing the result of path evaluation in filter expressions. |
Table 8.25. jsonpath
Accessors
Accessor Operator | Description |
---|---|
|
Member accessor that returns an object member with
the specified key. If the key name matches some named variable
starting with |
|
Wildcard member accessor that returns the values of all members located at the top level of the current object. |
|
Recursive wildcard member accessor that processes all levels of the JSON hierarchy of the current object and returns all the member values, regardless of their nesting level. This is a PostgreSQL extension of the SQL/JSON standard. |
|
Like |
|
Array element accessor.
The specified |
|
Wildcard array element accessor that returns all array elements. |
PostgreSQL allows columns of a table to be defined as variable-length multidimensional arrays. Arrays of any built-in or user-defined base type, enum type, composite type, range type, or domain can be created.
To illustrate the use of array types, we create this table:
CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][] );
As shown, an array data type is named by appending square brackets
([]
) to the data type name of the array elements. The
above command will create a table named
sal_emp
with a column of type
text
(name
), a
one-dimensional array of type integer
(pay_by_quarter
), which represents the
employee's salary by quarter, and a two-dimensional array of
text
(schedule
), which
represents the employee's weekly schedule.
The syntax for CREATE TABLE
allows the exact size of
arrays to be specified, for example:
CREATE TABLE tictactoe ( squares integer[3][3] );
However, the current implementation ignores any supplied array size limits, i.e., the behavior is the same as for arrays of unspecified length.
The current implementation does not enforce the declared
number of dimensions either. Arrays of a particular element type are
all considered to be of the same type, regardless of size or number
of dimensions. So, declaring the array size or number of dimensions in
CREATE TABLE
is simply documentation; it does not
affect run-time behavior.
An alternative syntax, which conforms to the SQL standard by using
the keyword ARRAY
, can be used for one-dimensional arrays.
pay_by_quarter
could have been defined
as:
pay_by_quarter integer ARRAY[4],
Or, if no array size is to be specified:
pay_by_quarter integer ARRAY,
As before, however, PostgreSQL does not enforce the size restriction in any case.
To write an array value as a literal constant, enclose the element values within curly braces and separate them by commas. (If you know C, this is not unlike the C syntax for initializing structures.) You can put double quotes around any element value, and must do so if it contains commas or curly braces. (More details appear below.) Thus, the general format of an array constant is the following:
'{val1
delim
val2
delim
... }'
where delim
is the delimiter character
for the type, as recorded in its pg_type
entry.
Among the standard data types provided in the
PostgreSQL distribution, all use a comma
(,
), except for type box
which uses a semicolon
(;
). Each val
is
either a constant of the array element type, or a subarray. An example
of an array constant is:
'{{1,2,3},{4,5,6},{7,8,9}}'
This constant is a two-dimensional, 3-by-3 array consisting of three subarrays of integers.
To set an element of an array constant to NULL, write NULL
for the element value. (Any upper- or lower-case variant of
NULL
will do.) If you want an actual string value
“NULL”, you must put double quotes around it.
(These kinds of array constants are actually only a special case of the generic type constants discussed in Section 4.1.2.7. The constant is initially treated as a string and passed to the array input conversion routine. An explicit type specification might be necessary.)
Now we can show some INSERT
statements:
INSERT INTO sal_emp VALUES ('Bill', '{10000, 10000, 10000, 10000}', '{{"meeting", "lunch"}, {"training", "presentation"}}'); INSERT INTO sal_emp VALUES ('Carol', '{20000, 25000, 25000, 25000}', '{{"breakfast", "consulting"}, {"meeting", "lunch"}}');
The result of the previous two inserts looks like this:
SELECT * FROM sal_emp; name | pay_by_quarter | schedule -------+---------------------------+------------------------------------------- Bill | {10000,10000,10000,10000} | {{meeting,lunch},{training,presentation}} Carol | {20000,25000,25000,25000} | {{breakfast,consulting},{meeting,lunch}} (2 rows)
Multidimensional arrays must have matching extents for each dimension. A mismatch causes an error, for example:
INSERT INTO sal_emp VALUES ('Bill', '{10000, 10000, 10000, 10000}', '{{"meeting", "lunch"}, {"meeting"}}'); ERROR: multidimensional arrays must have array expressions with matching dimensions
The ARRAY
constructor syntax can also be used:
INSERT INTO sal_emp VALUES ('Bill', ARRAY[10000, 10000, 10000, 10000], ARRAY[['meeting', 'lunch'], ['training', 'presentation']]); INSERT INTO sal_emp VALUES ('Carol', ARRAY[20000, 25000, 25000, 25000], ARRAY[['breakfast', 'consulting'], ['meeting', 'lunch']]);
Notice that the array elements are ordinary SQL constants or
expressions; for instance, string literals are single quoted, instead of
double quoted as they would be in an array literal. The ARRAY
constructor syntax is discussed in more detail in
Section 4.2.12.
Now, we can run some queries on the table. First, we show how to access a single element of an array. This query retrieves the names of the employees whose pay changed in the second quarter:
SELECT name FROM sal_emp WHERE pay_by_quarter[1] <> pay_by_quarter[2]; name ------- Carol (1 row)
The array subscript numbers are written within square brackets.
By default PostgreSQL uses a
one-based numbering convention for arrays, that is,
an array of n
elements starts with array[1]
and
ends with array[
.
n
]
This query retrieves the third quarter pay of all employees:
SELECT pay_by_quarter[3] FROM sal_emp; pay_by_quarter ---------------- 10000 25000 (2 rows)
We can also access arbitrary rectangular slices of an array, or
subarrays. An array slice is denoted by writing
for one or more array dimensions. For example, this query retrieves the first
item on Bill's schedule for the first two days of the week:
lower-bound
:upper-bound
SELECT schedule[1:2][1:1] FROM sal_emp WHERE name = 'Bill'; schedule ------------------------ {{meeting},{training}} (1 row)
If any dimension is written as a slice, i.e., contains a colon, then all
dimensions are treated as slices. Any dimension that has only a single
number (no colon) is treated as being from 1
to the number specified. For example, [2]
is treated as
[1:2]
, as in this example:
SELECT schedule[1:2][2] FROM sal_emp WHERE name = 'Bill'; schedule ------------------------------------------- {{meeting,lunch},{training,presentation}} (1 row)
To avoid confusion with the non-slice case, it's best to use slice syntax
for all dimensions, e.g., [1:2][1:1]
, not [2][1:1]
.
It is possible to omit the lower-bound
and/or
upper-bound
of a slice specifier; the missing
bound is replaced by the lower or upper limit of the array's subscripts.
For example:
SELECT schedule[:2][2:] FROM sal_emp WHERE name = 'Bill'; schedule ------------------------ {{lunch},{presentation}} (1 row) SELECT schedule[:][1:1] FROM sal_emp WHERE name = 'Bill'; schedule ------------------------ {{meeting},{training}} (1 row)
An array subscript expression will return null if either the array itself or
any of the subscript expressions are null. Also, null is returned if a
subscript is outside the array bounds (this case does not raise an error).
For example, if schedule
currently has the dimensions [1:3][1:2]
then referencing
schedule[3][3]
yields NULL. Similarly, an array reference
with the wrong number of subscripts yields a null rather than an error.
An array slice expression likewise yields null if the array itself or any of the subscript expressions are null. However, in other cases such as selecting an array slice that is completely outside the current array bounds, a slice expression yields an empty (zero-dimensional) array instead of null. (This does not match non-slice behavior and is done for historical reasons.) If the requested slice partially overlaps the array bounds, then it is silently reduced to just the overlapping region instead of returning null.
The current dimensions of any array value can be retrieved with the
array_dims
function:
SELECT array_dims(schedule) FROM sal_emp WHERE name = 'Carol'; array_dims ------------ [1:2][1:2] (1 row)
array_dims
produces a text
result,
which is convenient for people to read but perhaps inconvenient
for programs. Dimensions can also be retrieved with
array_upper
and array_lower
,
which return the upper and lower bound of a
specified array dimension, respectively:
SELECT array_upper(schedule, 1) FROM sal_emp WHERE name = 'Carol'; array_upper ------------- 2 (1 row)
array_length
will return the length of a specified
array dimension:
SELECT array_length(schedule, 1) FROM sal_emp WHERE name = 'Carol'; array_length -------------- 2 (1 row)
cardinality
returns the total number of elements in an
array across all dimensions. It is effectively the number of rows a call to
unnest
would yield:
SELECT cardinality(schedule) FROM sal_emp WHERE name = 'Carol'; cardinality ------------- 4 (1 row)
An array value can be replaced completely:
UPDATE sal_emp SET pay_by_quarter = '{25000,25000,27000,27000}' WHERE name = 'Carol';
or using the ARRAY
expression syntax:
UPDATE sal_emp SET pay_by_quarter = ARRAY[25000,25000,27000,27000] WHERE name = 'Carol';
An array can also be updated at a single element:
UPDATE sal_emp SET pay_by_quarter[4] = 15000 WHERE name = 'Bill';
or updated in a slice:
UPDATE sal_emp SET pay_by_quarter[1:2] = '{27000,27000}' WHERE name = 'Carol';
The slice syntaxes with omitted lower-bound
and/or
upper-bound
can be used too, but only when
updating an array value that is not NULL or zero-dimensional (otherwise,
there is no existing subscript limit to substitute).
A stored array value can be enlarged by assigning to elements not already
present. Any positions between those previously present and the newly
assigned elements will be filled with nulls. For example, if array
myarray
currently has 4 elements, it will have six
elements after an update that assigns to myarray[6]
;
myarray[5]
will contain null.
Currently, enlargement in this fashion is only allowed for one-dimensional
arrays, not multidimensional arrays.
Subscripted assignment allows creation of arrays that do not use one-based
subscripts. For example one might assign to myarray[-2:7]
to
create an array with subscript values from -2 to 7.
New array values can also be constructed using the concatenation operator,
||
:
SELECT ARRAY[1,2] || ARRAY[3,4]; ?column? ----------- {1,2,3,4} (1 row) SELECT ARRAY[5,6] || ARRAY[[1,2],[3,4]]; ?column? --------------------- {{5,6},{1,2},{3,4}} (1 row)
The concatenation operator allows a single element to be pushed onto the
beginning or end of a one-dimensional array. It also accepts two
N
-dimensional arrays, or an N
-dimensional
and an N+1
-dimensional array.
When a single element is pushed onto either the beginning or end of a one-dimensional array, the result is an array with the same lower bound subscript as the array operand. For example:
SELECT array_dims(1 || '[0:1]={2,3}'::int[]); array_dims ------------ [0:2] (1 row) SELECT array_dims(ARRAY[1,2] || 3); array_dims ------------ [1:3] (1 row)
When two arrays with an equal number of dimensions are concatenated, the result retains the lower bound subscript of the left-hand operand's outer dimension. The result is an array comprising every element of the left-hand operand followed by every element of the right-hand operand. For example:
SELECT array_dims(ARRAY[1,2] || ARRAY[3,4,5]); array_dims ------------ [1:5] (1 row) SELECT array_dims(ARRAY[[1,2],[3,4]] || ARRAY[[5,6],[7,8],[9,0]]); array_dims ------------ [1:5][1:2] (1 row)
When an N
-dimensional array is pushed onto the beginning
or end of an N+1
-dimensional array, the result is
analogous to the element-array case above. Each N
-dimensional
sub-array is essentially an element of the N+1
-dimensional
array's outer dimension. For example:
SELECT array_dims(ARRAY[1,2] || ARRAY[[3,4],[5,6]]); array_dims ------------ [1:3][1:2] (1 row)
An array can also be constructed by using the functions
array_prepend
, array_append
,
or array_cat
. The first two only support one-dimensional
arrays, but array_cat
supports multidimensional arrays.
Some examples:
SELECT array_prepend(1, ARRAY[2,3]); array_prepend --------------- {1,2,3} (1 row) SELECT array_append(ARRAY[1,2], 3); array_append -------------- {1,2,3} (1 row) SELECT array_cat(ARRAY[1,2], ARRAY[3,4]); array_cat ----------- {1,2,3,4} (1 row) SELECT array_cat(ARRAY[[1,2],[3,4]], ARRAY[5,6]); array_cat --------------------- {{1,2},{3,4},{5,6}} (1 row) SELECT array_cat(ARRAY[5,6], ARRAY[[1,2],[3,4]]); array_cat --------------------- {{5,6},{1,2},{3,4}}
In simple cases, the concatenation operator discussed above is preferred over direct use of these functions. However, because the concatenation operator is overloaded to serve all three cases, there are situations where use of one of the functions is helpful to avoid ambiguity. For example consider:
SELECT ARRAY[1, 2] || '{3, 4}'; -- the untyped literal is taken as an array ?column? ----------- {1,2,3,4} SELECT ARRAY[1, 2] || '7'; -- so is this one ERROR: malformed array literal: "7" SELECT ARRAY[1, 2] || NULL; -- so is an undecorated NULL ?column? ---------- {1,2} (1 row) SELECT array_append(ARRAY[1, 2], NULL); -- this might have been meant array_append -------------- {1,2,NULL}
In the examples above, the parser sees an integer array on one side of the
concatenation operator, and a constant of undetermined type on the other.
The heuristic it uses to resolve the constant's type is to assume it's of
the same type as the operator's other input — in this case,
integer array. So the concatenation operator is presumed to
represent array_cat
, not array_append
. When
that's the wrong choice, it could be fixed by casting the constant to the
array's element type; but explicit use of array_append
might
be a preferable solution.
To search for a value in an array, each value must be checked. This can be done manually, if you know the size of the array. For example:
SELECT * FROM sal_emp WHERE pay_by_quarter[1] = 10000 OR pay_by_quarter[2] = 10000 OR pay_by_quarter[3] = 10000 OR pay_by_quarter[4] = 10000;
However, this quickly becomes tedious for large arrays, and is not helpful if the size of the array is unknown. An alternative method is described in Section 9.24. The above query could be replaced by:
SELECT * FROM sal_emp WHERE 10000 = ANY (pay_by_quarter);
In addition, you can find rows where the array has all values equal to 10000 with:
SELECT * FROM sal_emp WHERE 10000 = ALL (pay_by_quarter);
Alternatively, the generate_subscripts
function can be used.
For example:
SELECT * FROM (SELECT pay_by_quarter, generate_subscripts(pay_by_quarter, 1) AS s FROM sal_emp) AS foo WHERE pay_by_quarter[s] = 10000;
This function is described in Table 9.64.
You can also search an array using the &&
operator,
which checks whether the left operand overlaps with the right operand.
For instance:
SELECT * FROM sal_emp WHERE pay_by_quarter && ARRAY[10000];
This and other array operators are further described in Section 9.19. It can be accelerated by an appropriate index, as described in Section 11.2.
You can also search for specific values in an array using the array_position
and array_positions
functions. The former returns the subscript of
the first occurrence of a value in an array; the latter returns an array with the
subscripts of all occurrences of the value in the array. For example:
SELECT array_position(ARRAY['sun','mon','tue','wed','thu','fri','sat'], 'mon'); array_position ---------------- 2 (1 row) SELECT array_positions(ARRAY[1, 4, 3, 1, 3, 4, 2, 1], 1); array_positions ----------------- {1,4,8} (1 row)
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
The external text representation of an array value consists of items that
are interpreted according to the I/O conversion rules for the array's
element type, plus decoration that indicates the array structure.
The decoration consists of curly braces ({
and }
)
around the array value plus delimiter characters between adjacent items.
The delimiter character is usually a comma (,
) but can be
something else: it is determined by the typdelim
setting
for the array's element type. Among the standard data types provided
in the PostgreSQL distribution, all use a comma,
except for type box
, which uses a semicolon (;
).
In a multidimensional array, each dimension (row, plane,
cube, etc.) gets its own level of curly braces, and delimiters
must be written between adjacent curly-braced entities of the same level.
The array output routine will put double quotes around element values
if they are empty strings, contain curly braces, delimiter characters,
double quotes, backslashes, or white space, or match the word
NULL
. Double quotes and backslashes
embedded in element values will be backslash-escaped. For numeric
data types it is safe to assume that double quotes will never appear, but
for textual data types one should be prepared to cope with either the presence
or absence of quotes.
By default, the lower bound index value of an array's dimensions is
set to one. To represent arrays with other lower bounds, the array
subscript ranges can be specified explicitly before writing the
array contents.
This decoration consists of square brackets ([]
)
around each array dimension's lower and upper bounds, with
a colon (:
) delimiter character in between. The
array dimension decoration is followed by an equal sign (=
).
For example:
SELECT f1[1][-2][3] AS e1, f1[1][-1][5] AS e2 FROM (SELECT '[1:1][-2:-1][3:5]={{{1,2,3},{4,5,6}}}'::int[] AS f1) AS ss; e1 | e2 ----+---- 1 | 6 (1 row)
The array output routine will include explicit dimensions in its result only when there are one or more lower bounds different from one.
If the value written for an element is NULL
(in any case
variant), the element is taken to be NULL. The presence of any quotes
or backslashes disables this and allows the literal string value
“NULL” to be entered. Also, for backward compatibility with
pre-8.2 versions of PostgreSQL, the array_nulls configuration parameter can be turned
off
to suppress recognition of NULL
as a NULL.
As shown previously, when writing an array value you can use double
quotes around any individual array element. You must do so
if the element value would otherwise confuse the array-value parser.
For example, elements containing curly braces, commas (or the data type's
delimiter character), double quotes, backslashes, or leading or trailing
whitespace must be double-quoted. Empty strings and strings matching the
word NULL
must be quoted, too. To put a double
quote or backslash in a quoted array element value, precede it
with a backslash. Alternatively, you can avoid quotes and use
backslash-escaping to protect all data characters that would otherwise
be taken as array syntax.
You can add whitespace before a left brace or after a right brace. You can also add whitespace before or after any individual item string. In all of these cases the whitespace will be ignored. However, whitespace within double-quoted elements, or surrounded on both sides by non-whitespace characters of an element, is not ignored.
The ARRAY
constructor syntax (see
Section 4.2.12) is often easier to work
with than the array-literal syntax when writing array values in SQL
commands. In ARRAY
, individual element values are written the
same way they would be written when not members of an array.
A composite type represents the structure of a row or record; it is essentially just a list of field names and their data types. PostgreSQL allows composite types to be used in many of the same ways that simple types can be used. For example, a column of a table can be declared to be of a composite type.
Here are two simple examples of defining composite types:
CREATE TYPE complex AS ( r double precision, i double precision ); CREATE TYPE inventory_item AS ( name text, supplier_id integer, price numeric );
The syntax is comparable to CREATE TABLE
, except that only
field names and types can be specified; no constraints (such as NOT
NULL
) can presently be included. Note that the AS
keyword
is essential; without it, the system will think a different kind
of CREATE TYPE
command is meant, and you will get odd syntax
errors.
Having defined the types, we can use them to create tables:
CREATE TABLE on_hand ( item inventory_item, count integer ); INSERT INTO on_hand VALUES (ROW('fuzzy dice', 42, 1.99), 1000);
or functions:
CREATE FUNCTION price_extension(inventory_item, integer) RETURNS numeric AS 'SELECT $1.price * $2' LANGUAGE SQL; SELECT price_extension(item, 10) FROM on_hand;
Whenever you create a table, a composite type is also automatically created, with the same name as the table, to represent the table's row type. For example, had we said:
CREATE TABLE inventory_item ( name text, supplier_id integer REFERENCES suppliers, price numeric CHECK (price > 0) );
then the same inventory_item
composite type shown above would
come into being as a
byproduct, and could be used just as above. Note however an important
restriction of the current implementation: since no constraints are
associated with a composite type, the constraints shown in the table
definition do not apply to values of the composite type
outside the table. (To work around this, create a domain over the composite
type, and apply the desired constraints as CHECK
constraints of the domain.)
To write a composite value as a literal constant, enclose the field values within parentheses and separate them by commas. You can put double quotes around any field value, and must do so if it contains commas or parentheses. (More details appear below.) Thus, the general format of a composite constant is the following:
'(val1
,val2
, ... )'
An example is:
'("fuzzy dice",42,1.99)'
which would be a valid value of the inventory_item
type
defined above. To make a field be NULL, write no characters at all
in its position in the list. For example, this constant specifies
a NULL third field:
'("fuzzy dice",42,)'
If you want an empty string rather than NULL, write double quotes:
'("",42,)'
Here the first field is a non-NULL empty string, the third is NULL.
(These constants are actually only a special case of the generic type constants discussed in Section 4.1.2.7. The constant is initially treated as a string and passed to the composite-type input conversion routine. An explicit type specification might be necessary to tell which type to convert the constant to.)
The ROW
expression syntax can also be used to
construct composite values. In most cases this is considerably
simpler to use than the string-literal syntax since you don't have
to worry about multiple layers of quoting. We already used this
method above:
ROW('fuzzy dice', 42, 1.99) ROW('', 42, NULL)
The ROW keyword is actually optional as long as you have more than one field in the expression, so these can be simplified to:
('fuzzy dice', 42, 1.99) ('', 42, NULL)
The ROW
expression syntax is discussed in more detail in Section 4.2.13.
To access a field of a composite column, one writes a dot and the field
name, much like selecting a field from a table name. In fact, it's so
much like selecting from a table name that you often have to use parentheses
to keep from confusing the parser. For example, you might try to select
some subfields from our on_hand
example table with something
like:
SELECT item.name FROM on_hand WHERE item.price > 9.99;
This will not work since the name item
is taken to be a table
name, not a column name of on_hand
, per SQL syntax rules.
You must write it like this:
SELECT (item).name FROM on_hand WHERE (item).price > 9.99;
or if you need to use the table name as well (for instance in a multitable query), like this:
SELECT (on_hand.item).name FROM on_hand WHERE (on_hand.item).price > 9.99;
Now the parenthesized object is correctly interpreted as a reference to
the item
column, and then the subfield can be selected from it.
Similar syntactic issues apply whenever you select a field from a composite value. For instance, to select just one field from the result of a function that returns a composite value, you'd need to write something like:
SELECT (my_func(...)).field FROM ...
Without the extra parentheses, this will generate a syntax error.
The special field name *
means “all fields”, as
further explained in Section 8.16.5.
Here are some examples of the proper syntax for inserting and updating composite columns. First, inserting or updating a whole column:
INSERT INTO mytab (complex_col) VALUES((1.1,2.2)); UPDATE mytab SET complex_col = ROW(1.1,2.2) WHERE ...;
The first example omits ROW
, the second uses it; we
could have done it either way.
We can update an individual subfield of a composite column:
UPDATE mytab SET complex_col.r = (complex_col).r + 1 WHERE ...;
Notice here that we don't need to (and indeed cannot)
put parentheses around the column name appearing just after
SET
, but we do need parentheses when referencing the same
column in the expression to the right of the equal sign.
And we can specify subfields as targets for INSERT
, too:
INSERT INTO mytab (complex_col.r, complex_col.i) VALUES(1.1, 2.2);
Had we not supplied values for all the subfields of the column, the remaining subfields would have been filled with null values.
There are various special syntax rules and behaviors associated with composite types in queries. These rules provide useful shortcuts, but can be confusing if you don't know the logic behind them.
In PostgreSQL, a reference to a table name (or alias)
in a query is effectively a reference to the composite value of the
table's current row. For example, if we had a table
inventory_item
as shown
above, we could write:
SELECT c FROM inventory_item c;
This query produces a single composite-valued column, so we might get output like:
c ------------------------ ("fuzzy dice",42,1.99) (1 row)
Note however that simple names are matched to column names before table
names, so this example works only because there is no column
named c
in the query's tables.
The ordinary qualified-column-name
syntax table_name
.
column_name
can be understood as applying field
selection to the composite value of the table's current row.
(For efficiency reasons, it's not actually implemented that way.)
When we write
SELECT c.* FROM inventory_item c;
then, according to the SQL standard, we should get the contents of the table expanded into separate columns:
name | supplier_id | price ------------+-------------+------- fuzzy dice | 42 | 1.99 (1 row)
as if the query were
SELECT c.name, c.supplier_id, c.price FROM inventory_item c;
PostgreSQL will apply this expansion behavior to
any composite-valued expression, although as shown above, you need to write parentheses
around the value that .*
is applied to whenever it's not a
simple table name. For example, if myfunc()
is a function
returning a composite type with columns a
,
b
, and c
, then these two queries have the
same result:
SELECT (myfunc(x)).* FROM some_table; SELECT (myfunc(x)).a, (myfunc(x)).b, (myfunc(x)).c FROM some_table;
PostgreSQL handles column expansion by
actually transforming the first form into the second. So, in this
example, myfunc()
would get invoked three times per row
with either syntax. If it's an expensive function you may wish to
avoid that, which you can do with a query like:
SELECT m.* FROM some_table, LATERAL myfunc(x) AS m;
Placing the function in
a LATERAL
FROM
item keeps it from
being invoked more than once per row. m.*
is still
expanded into m.a, m.b, m.c
, but now those variables
are just references to the output of the FROM
item.
(The LATERAL
keyword is optional here, but we show it
to clarify that the function is getting x
from some_table
.)
The composite_value
.*
syntax results in
column expansion of this kind when it appears at the top level of
a SELECT
output
list, a RETURNING
list in INSERT
/UPDATE
/DELETE
,
a VALUES
clause, or
a row constructor.
In all other contexts (including when nested inside one of those
constructs), attaching .*
to a composite value does not
change the value, since it means “all columns” and so the
same composite value is produced again. For example,
if somefunc()
accepts a composite-valued argument,
these queries are the same:
SELECT somefunc(c.*) FROM inventory_item c; SELECT somefunc(c) FROM inventory_item c;
In both cases, the current row of inventory_item
is
passed to the function as a single composite-valued argument.
Even though .*
does nothing in such cases, using it is good
style, since it makes clear that a composite value is intended. In
particular, the parser will consider c
in c.*
to
refer to a table name or alias, not to a column name, so that there is
no ambiguity; whereas without .*
, it is not clear
whether c
means a table name or a column name, and in fact
the column-name interpretation will be preferred if there is a column
named c
.
Another example demonstrating these concepts is that all these queries mean the same thing:
SELECT * FROM inventory_item c ORDER BY c; SELECT * FROM inventory_item c ORDER BY c.*; SELECT * FROM inventory_item c ORDER BY ROW(c.*);
All of these ORDER BY
clauses specify the row's composite
value, resulting in sorting the rows according to the rules described
in Section 9.24.6. However,
if inventory_item
contained a column
named c
, the first case would be different from the
others, as it would mean to sort by that column only. Given the column
names previously shown, these queries are also equivalent to those above:
SELECT * FROM inventory_item c ORDER BY ROW(c.name, c.supplier_id, c.price); SELECT * FROM inventory_item c ORDER BY (c.name, c.supplier_id, c.price);
(The last case uses a row constructor with the key word ROW
omitted.)
Another special syntactical behavior associated with composite values is
that we can use functional notation for extracting a field
of a composite value. The simple way to explain this is that
the notations
and field
(table
)
are interchangeable. For example, these queries are equivalent:
table
.field
SELECT c.name FROM inventory_item c WHERE c.price > 1000; SELECT name(c) FROM inventory_item c WHERE price(c) > 1000;
Moreover, if we have a function that accepts a single argument of a composite type, we can call it with either notation. These queries are all equivalent:
SELECT somefunc(c) FROM inventory_item c; SELECT somefunc(c.*) FROM inventory_item c; SELECT c.somefunc FROM inventory_item c;
This equivalence between functional notation and field notation
makes it possible to use functions on composite types to implement
“computed fields”.
An application using the last query above wouldn't need to be directly
aware that somefunc
isn't a real column of the table.
Because of this behavior, it's unwise to give a function that takes a
single composite-type argument the same name as any of the fields of
that composite type. If there is ambiguity, the field-name
interpretation will be chosen if field-name syntax is used, while the
function will be chosen if function-call syntax is used. However,
PostgreSQL versions before 11 always chose the
field-name interpretation, unless the syntax of the call required it to
be a function call. One way to force the function interpretation in
older versions is to schema-qualify the function name, that is, write
.
schema
.func
(compositevalue
)
The external text representation of a composite value consists of items that
are interpreted according to the I/O conversion rules for the individual
field types, plus decoration that indicates the composite structure.
The decoration consists of parentheses ((
and )
)
around the whole value, plus commas (,
) between adjacent
items. Whitespace outside the parentheses is ignored, but within the
parentheses it is considered part of the field value, and might or might not be
significant depending on the input conversion rules for the field data type.
For example, in:
'( 42)'
the whitespace will be ignored if the field type is integer, but not if it is text.
As shown previously, when writing a composite value you can write double quotes around any individual field value. You must do so if the field value would otherwise confuse the composite-value parser. In particular, fields containing parentheses, commas, double quotes, or backslashes must be double-quoted. To put a double quote or backslash in a quoted composite field value, precede it with a backslash. (Also, a pair of double quotes within a double-quoted field value is taken to represent a double quote character, analogously to the rules for single quotes in SQL literal strings.) Alternatively, you can avoid quoting and use backslash-escaping to protect all data characters that would otherwise be taken as composite syntax.
A completely empty field value (no characters at all between the commas
or parentheses) represents a NULL. To write a value that is an empty
string rather than NULL, write ""
.
The composite output routine will put double quotes around field values if they are empty strings or contain parentheses, commas, double quotes, backslashes, or white space. (Doing so for white space is not essential, but aids legibility.) Double quotes and backslashes embedded in field values will be doubled.
Remember that what you write in an SQL command will first be interpreted
as a string literal, and then as a composite. This doubles the number of
backslashes you need (assuming escape string syntax is used).
For example, to insert a text
field
containing a double quote and a backslash in a composite
value, you'd need to write:
INSERT ... VALUES ('("\"\\")');
The string-literal processor removes one level of backslashes, so that
what arrives at the composite-value parser looks like
("\"\\")
. In turn, the string
fed to the text
data type's input routine
becomes "\
. (If we were working
with a data type whose input routine also treated backslashes specially,
bytea
for example, we might need as many as eight backslashes
in the command to get one backslash into the stored composite field.)
Dollar quoting (see Section 4.1.2.4) can be
used to avoid the need to double backslashes.
The ROW
constructor syntax is usually easier to work with
than the composite-literal syntax when writing composite values in SQL
commands.
In ROW
, individual field values are written the same way
they would be written when not members of a composite.
Range types are data types representing a range of values of some
element type (called the range's subtype).
For instance, ranges
of timestamp
might be used to represent the ranges of
time that a meeting room is reserved. In this case the data type
is tsrange
(short for “timestamp range”),
and timestamp
is the subtype. The subtype must have
a total order so that it is well-defined whether element values are
within, before, or after a range of values.
Range types are useful because they represent many element values in a single range value, and because concepts such as overlapping ranges can be expressed clearly. The use of time and date ranges for scheduling purposes is the clearest example; but price ranges, measurement ranges from an instrument, and so forth can also be useful.
Every range type has a corresponding multirange type. A multirange is an ordered list of non-contiguous, non-empty, non-null ranges. Most range operators also work on multiranges, and they have a few functions of their own.
PostgreSQL comes with the following built-in range types:
int4range
— Range of integer
,
int4multirange
— corresponding Multirange
int8range
— Range of bigint
,
int8multirange
— corresponding Multirange
numrange
— Range of numeric
,
nummultirange
— corresponding Multirange
tsrange
— Range of timestamp without time zone
,
tsmultirange
— corresponding Multirange
tstzrange
— Range of timestamp with time zone
,
tstzmultirange
— corresponding Multirange
daterange
— Range of date
,
datemultirange
— corresponding Multirange
In addition, you can define your own range types; see CREATE TYPE for more information.
CREATE TABLE reservation (room int, during tsrange); INSERT INTO reservation VALUES (1108, '[2010-01-01 14:30, 2010-01-01 15:30)'); -- Containment SELECT int4range(10, 20) @> 3; -- Overlaps SELECT numrange(11.1, 22.2) && numrange(20.0, 30.0); -- Extract the upper bound SELECT upper(int8range(15, 25)); -- Compute the intersection SELECT int4range(10, 20) * int4range(15, 25); -- Is the range empty? SELECT isempty(numrange(1, 5));
See Table 9.53 and Table 9.55 for complete lists of operators and functions on range types.
Every non-empty range has two bounds, the lower bound and the upper bound. All points between these values are included in the range. An inclusive bound means that the boundary point itself is included in the range as well, while an exclusive bound means that the boundary point is not included in the range.
In the text form of a range, an inclusive lower bound is represented by
“[
” while an exclusive lower bound is
represented by “(
”. Likewise, an inclusive upper bound is represented by
“]
”, while an exclusive upper bound is
represented by “)
”.
(See Section 8.17.5 for more details.)
The functions lower_inc
and upper_inc
test the inclusivity of the lower
and upper bounds of a range value, respectively.
The lower bound of a range can be omitted, meaning that all
values less than the upper bound are included in the range, e.g.,
(,3]
. Likewise, if the upper bound of the range
is omitted, then all values greater than the lower bound are included
in the range. If both lower and upper bounds are omitted, all values
of the element type are considered to be in the range. Specifying a
missing bound as inclusive is automatically converted to exclusive,
e.g., [,]
is converted to (,)
.
You can think of these missing values as +/-infinity, but they are
special range type values and are considered to be beyond any range
element type's +/-infinity values.
Element types that have the notion of “infinity” can
use them as explicit bound values. For example, with timestamp
ranges, [today,infinity)
excludes the special
timestamp
value infinity
,
while [today,infinity]
include it, as does
[today,)
and [today,]
.
The functions lower_inf
and upper_inf
test for infinite lower
and upper bounds of a range, respectively.
The input for a range value must follow one of the following patterns:
(lower-bound
,upper-bound
) (lower-bound
,upper-bound
] [lower-bound
,upper-bound
) [lower-bound
,upper-bound
] empty
The parentheses or brackets indicate whether the lower and upper bounds
are exclusive or inclusive, as described previously.
Notice that the final pattern is empty
, which
represents an empty range (a range that contains no points).
The lower-bound
may be either a string
that is valid input for the subtype, or empty to indicate no
lower bound. Likewise, upper-bound
may be
either a string that is valid input for the subtype, or empty to
indicate no upper bound.
Each bound value can be quoted using "
(double quote)
characters. This is necessary if the bound value contains parentheses,
brackets, commas, double quotes, or backslashes, since these characters
would otherwise be taken as part of the range syntax. To put a double
quote or backslash in a quoted bound value, precede it with a
backslash. (Also, a pair of double quotes within a double-quoted bound
value is taken to represent a double quote character, analogously to the
rules for single quotes in SQL literal strings.) Alternatively, you can
avoid quoting and use backslash-escaping to protect all data characters
that would otherwise be taken as range syntax. Also, to write a bound
value that is an empty string, write ""
, since writing
nothing means an infinite bound.
Whitespace is allowed before and after the range value, but any whitespace between the parentheses or brackets is taken as part of the lower or upper bound value. (Depending on the element type, it might or might not be significant.)
These rules are very similar to those for writing field values in composite-type literals. See Section 8.16.6 for additional commentary.
Examples:
-- includes 3, does not include 7, and does include all points in between SELECT '[3,7)'::int4range; -- does not include either 3 or 7, but includes all points in between SELECT '(3,7)'::int4range; -- includes only the single point 4 SELECT '[4,4]'::int4range; -- includes no points (and will be normalized to 'empty') SELECT '[4,4)'::int4range;
The input for a multirange is curly brackets ({
and
}
) containing zero or more valid ranges,
separated by commas. Whitespace is permitted around the brackets and
commas. This is intended to be reminiscent of array syntax, although
multiranges are much simpler: they have just one dimension and there is
no need to quote their contents. (The bounds of their ranges may be
quoted as above however.)
Examples:
SELECT '{}'::int4multirange; SELECT '{[3,7)}'::int4multirange; SELECT '{[3,7), [8,9)}'::int4multirange;
Each range type has a constructor function with the same name as the range
type. Using the constructor function is frequently more convenient than
writing a range literal constant, since it avoids the need for extra
quoting of the bound values. The constructor function
accepts two or three arguments. The two-argument form constructs a range
in standard form (lower bound inclusive, upper bound exclusive), while
the three-argument form constructs a range with bounds of the form
specified by the third argument.
The third argument must be one of the strings
“()
”,
“(]
”,
“[)
”, or
“[]
”.
For example:
-- The full form is: lower bound, upper bound, and text argument indicating -- inclusivity/exclusivity of bounds. SELECT numrange(1.0, 14.0, '(]'); -- If the third argument is omitted, '[)' is assumed. SELECT numrange(1.0, 14.0); -- Although '(]' is specified here, on display the value will be converted to -- canonical form, since int8range is a discrete range type (see below). SELECT int8range(1, 14, '(]'); -- Using NULL for either bound causes the range to be unbounded on that side. SELECT numrange(NULL, 2.2);
Each range type also has a multirange constructor with the same name as the multirange type. The constructor function takes zero or more arguments which are all ranges of the appropriate type. For example:
SELECT nummultirange(); SELECT nummultirange(numrange(1.0, 14.0)); SELECT nummultirange(numrange(1.0, 14.0), numrange(20.0, 25.0));
A discrete range is one whose element type has a well-defined
“step”, such as integer
or date
.
In these types two elements can be said to be adjacent, when there are
no valid values between them. This contrasts with continuous ranges,
where it's always (or almost always) possible to identify other element
values between two given values. For example, a range over the
numeric
type is continuous, as is a range over timestamp
.
(Even though timestamp
has limited precision, and so could
theoretically be treated as discrete, it's better to consider it continuous
since the step size is normally not of interest.)
Another way to think about a discrete range type is that there is a clear
idea of a “next” or “previous” value for each element value.
Knowing that, it is possible to convert between inclusive and exclusive
representations of a range's bounds, by choosing the next or previous
element value instead of the one originally given.
For example, in an integer range type [4,8]
and
(3,9)
denote the same set of values; but this would not be so
for a range over numeric.
A discrete range type should have a canonicalization function that is aware of the desired step size for the element type. The canonicalization function is charged with converting equivalent values of the range type to have identical representations, in particular consistently inclusive or exclusive bounds. If a canonicalization function is not specified, then ranges with different formatting will always be treated as unequal, even though they might represent the same set of values in reality.
The built-in range types int4range
, int8range
,
and daterange
all use a canonical form that includes
the lower bound and excludes the upper bound; that is,
[)
. User-defined range types can use other conventions,
however.
Users can define their own range types. The most common reason to do
this is to use ranges over subtypes not provided among the built-in
range types.
For example, to define a new range type of subtype float8
:
CREATE TYPE floatrange AS RANGE ( subtype = float8, subtype_diff = float8mi ); SELECT '[1.234, 5.678]'::floatrange;
Because float8
has no meaningful
“step”, we do not define a canonicalization
function in this example.
When you define your own range you automatically get a corresponding multirange type.
Defining your own range type also allows you to specify a different subtype B-tree operator class or collation to use, so as to change the sort ordering that determines which values fall into a given range.
If the subtype is considered to have discrete rather than continuous
values, the CREATE TYPE
command should specify a
canonical
function.
The canonicalization function takes an input range value, and must return
an equivalent range value that may have different bounds and formatting.
The canonical output for two ranges that represent the same set of values,
for example the integer ranges [1, 7]
and [1,
8)
, must be identical. It doesn't matter which representation
you choose to be the canonical one, so long as two equivalent values with
different formattings are always mapped to the same value with the same
formatting. In addition to adjusting the inclusive/exclusive bounds
format, a canonicalization function might round off boundary values, in
case the desired step size is larger than what the subtype is capable of
storing. For instance, a range type over timestamp
could be
defined to have a step size of an hour, in which case the canonicalization
function would need to round off bounds that weren't a multiple of an hour,
or perhaps throw an error instead.
In addition, any range type that is meant to be used with GiST or SP-GiST
indexes should define a subtype difference, or subtype_diff
,
function. (The index will still work without subtype_diff
,
but it is likely to be considerably less efficient than if a difference
function is provided.) The subtype difference function takes two input
values of the subtype, and returns their difference
(i.e., X
minus Y
) represented as
a float8
value. In our example above, the
function float8mi
that underlies the regular float8
minus operator can be used; but for any other subtype, some type
conversion would be necessary. Some creative thought about how to
represent differences as numbers might be needed, too. To the greatest
extent possible, the subtype_diff
function should agree with
the sort ordering implied by the selected operator class and collation;
that is, its result should be positive whenever its first argument is
greater than its second according to the sort ordering.
A less-oversimplified example of a subtype_diff
function is:
CREATE FUNCTION time_subtype_diff(x time, y time) RETURNS float8 AS 'SELECT EXTRACT(EPOCH FROM (x - y))' LANGUAGE sql STRICT IMMUTABLE; CREATE TYPE timerange AS RANGE ( subtype = time, subtype_diff = time_subtype_diff ); SELECT '[11:10, 23:00]'::timerange;
See CREATE TYPE for more information about creating range types.
GiST and SP-GiST indexes can be created for table columns of range types. GiST indexes can be also created for table columns of multirange types. For instance, to create a GiST index:
CREATE INDEX reservation_idx ON reservation USING GIST (during);
A GiST or SP-GiST index on ranges can accelerate queries involving these
range operators:
=
,
&&
,
<@
,
@>
,
<<
,
>>
,
-|-
,
&<
, and
&>
.
A GiST index on multiranges can accelerate queries involving the same
set of multirange operators.
A GiST index on ranges and GiST index on multiranges can also accelerate
queries involving these cross-type range to multirange and multirange to
range operators correspondingly:
&&
,
<@
,
@>
,
<<
,
>>
,
-|-
,
&<
, and
&>
.
See Table 9.53 for more information.
In addition, B-tree and hash indexes can be created for table columns of
range types. For these index types, basically the only useful range
operation is equality. There is a B-tree sort ordering defined for range
values, with corresponding <
and >
operators,
but the ordering is rather arbitrary and not usually useful in the real
world. Range types' B-tree and hash support is primarily meant to
allow sorting and hashing internally in queries, rather than creation of
actual indexes.
While UNIQUE
is a natural constraint for scalar
values, it is usually unsuitable for range types. Instead, an
exclusion constraint is often more appropriate
(see CREATE TABLE
... CONSTRAINT ... EXCLUDE). Exclusion constraints allow the
specification of constraints such as “non-overlapping” on a
range type. For example:
CREATE TABLE reservation ( during tsrange, EXCLUDE USING GIST (during WITH &&) );
That constraint will prevent any overlapping values from existing in the table at the same time:
INSERT INTO reservation VALUES ('[2010-01-01 11:30, 2010-01-01 15:00)'); INSERT 0 1 INSERT INTO reservation VALUES ('[2010-01-01 14:45, 2010-01-01 15:45)'); ERROR: conflicting key value violates exclusion constraint "reservation_during_excl" DETAIL: Key (during)=(["2010-01-01 14:45:00","2010-01-01 15:45:00")) conflicts with existing key (during)=(["2010-01-01 11:30:00","2010-01-01 15:00:00")).
You can use the btree_gist
extension to define exclusion constraints on plain scalar data types, which
can then be combined with range exclusions for maximum flexibility. For
example, after btree_gist
is installed, the following
constraint will reject overlapping ranges only if the meeting room numbers
are equal:
CREATE EXTENSION btree_gist; CREATE TABLE room_reservation ( room text, during tsrange, EXCLUDE USING GIST (room WITH =, during WITH &&) ); INSERT INTO room_reservation VALUES ('123A', '[2010-01-01 14:00, 2010-01-01 15:00)'); INSERT 0 1 INSERT INTO room_reservation VALUES ('123A', '[2010-01-01 14:30, 2010-01-01 15:30)'); ERROR: conflicting key value violates exclusion constraint "room_reservation_room_during_excl" DETAIL: Key (room, during)=(123A, ["2010-01-01 14:30:00","2010-01-01 15:30:00")) conflicts with existing key (room, during)=(123A, ["2010-01-01 14:00:00","2010-01-01 15:00:00")). INSERT INTO room_reservation VALUES ('123B', '[2010-01-01 14:30, 2010-01-01 15:30)'); INSERT 0 1
A domain is a user-defined data type that is based on another underlying type. Optionally, it can have constraints that restrict its valid values to a subset of what the underlying type would allow. Otherwise it behaves like the underlying type — for example, any operator or function that can be applied to the underlying type will work on the domain type. The underlying type can be any built-in or user-defined base type, enum type, array type, composite type, range type, or another domain.
For example, we could create a domain over integers that accepts only positive integers:
CREATE DOMAIN posint AS integer CHECK (VALUE > 0); CREATE TABLE mytable (id posint); INSERT INTO mytable VALUES(1); -- works INSERT INTO mytable VALUES(-1); -- fails
When an operator or function of the underlying type is applied to a
domain value, the domain is automatically down-cast to the underlying
type. Thus, for example, the result of mytable.id - 1
is considered to be of type integer
not posint
.
We could write (mytable.id - 1)::posint
to cast the
result back to posint
, causing the domain's constraints
to be rechecked. In this case, that would result in an error if the
expression had been applied to an id
value of
1. Assigning a value of the underlying type to a field or variable of
the domain type is allowed without writing an explicit cast, but the
domain's constraints will be checked.
For additional information see CREATE DOMAIN.
Object identifiers (OIDs) are used internally by
PostgreSQL as primary keys for various
system tables.
Type oid
represents an object identifier. There are also
several alias types for oid
, each
named reg
.
Table 8.26 shows an
overview.
something
The oid
type is currently implemented as an unsigned
four-byte integer. Therefore, it is not large enough to provide
database-wide uniqueness in large databases, or even in large
individual tables.
The oid
type itself has few operations beyond comparison.
It can be cast to integer, however, and then manipulated using the
standard integer operators. (Beware of possible
signed-versus-unsigned confusion if you do this.)
The OID alias types have no operations of their own except
for specialized input and output routines. These routines are able
to accept and display symbolic names for system objects, rather than
the raw numeric value that type oid
would use. The alias
types allow simplified lookup of OID values for objects. For example,
to examine the pg_attribute
rows related to a table
mytable
, one could write:
SELECT * FROM pg_attribute WHERE attrelid = 'mytable'::regclass;
rather than:
SELECT * FROM pg_attribute WHERE attrelid = (SELECT oid FROM pg_class WHERE relname = 'mytable');
While that doesn't look all that bad by itself, it's still oversimplified.
A far more complicated sub-select would be needed to
select the right OID if there are multiple tables named
mytable
in different schemas.
The regclass
input converter handles the table lookup according
to the schema path setting, and so it does the “right thing”
automatically. Similarly, casting a table's OID to
regclass
is handy for symbolic display of a numeric OID.
Table 8.26. Object Identifier Types
Name | References | Description | Value Example |
---|---|---|---|
oid | any | numeric object identifier | 564182 |
regclass | pg_class | relation name | pg_type |
regcollation | pg_collation | collation name | "POSIX" |
regconfig | pg_ts_config | text search configuration | english |
regdictionary | pg_ts_dict | text search dictionary | simple |
regnamespace | pg_namespace | namespace name | pg_catalog |
regoper | pg_operator | operator name | + |
regoperator | pg_operator | operator with argument types | *(integer,integer)
or -(NONE,integer) |
regproc | pg_proc | function name | sum |
regprocedure | pg_proc | function with argument types | sum(int4) |
regrole | pg_authid | role name | smithee |
regtype | pg_type | data type name | integer |
All of the OID alias types for objects that are grouped by namespace
accept schema-qualified names, and will
display schema-qualified names on output if the object would not
be found in the current search path without being qualified.
For example, myschema.mytable
is acceptable input
for regclass
(if there is such a table). That value
might be output as myschema.mytable
, or
just mytable
, depending on the current search path.
The regproc
and regoper
alias types will only
accept input names that are unique (not overloaded), so they are
of limited use; for most uses regprocedure
or
regoperator
are more appropriate. For regoperator
,
unary operators are identified by writing NONE
for the unused
operand.
The input functions for these types allow whitespace between tokens,
and will fold upper-case letters to lower case, except within double
quotes; this is done to make the syntax rules similar to the way
object names are written in SQL. Conversely, the output functions
will use double quotes if needed to make the output be a valid SQL
identifier. For example, the OID of a function
named Foo
(with upper case F
)
taking two integer arguments could be entered as
' "Foo" ( int, integer ) '::regprocedure
. The
output would look like "Foo"(integer,integer)
.
Both the function name and the argument type names could be
schema-qualified, too.
Many built-in PostgreSQL functions accept
the OID of a table, or another kind of database object, and for
convenience are declared as taking regclass
(or the
appropriate OID alias type). This means you do not have to look up
the object's OID by hand, but can just enter its name as a string
literal. For example, the nextval(regclass)
function
takes a sequence relation's OID, so you could call it like this:
nextval('foo') operates on sequencefoo
nextval('FOO') same as above nextval('"Foo"') operates on sequenceFoo
nextval('myschema.foo') operates onmyschema.foo
nextval('"myschema".foo') same as above nextval('foo') searches search path forfoo
When you write the argument of such a function as an unadorned
literal string, it becomes a constant of type regclass
(or the appropriate type).
Since this is really just an OID, it will track the originally
identified object despite later renaming, schema reassignment,
etc. This “early binding” behavior is usually desirable for
object references in column defaults and views. But sometimes you might
want “late binding” where the object reference is resolved
at run time. To get late-binding behavior, force the constant to be
stored as a text
constant instead of regclass
:
nextval('foo'::text) foo
is looked up at runtime
The to_regclass()
function and its siblings
can also be used to perform run-time lookups. See
Table 9.70.
Another practical example of use of regclass
is to look up the OID of a table listed in
the information_schema
views, which don't supply
such OIDs directly. One might for example wish to call
the pg_relation_size()
function, which requires
the table OID. Taking the above rules into account, the correct way
to do that is
SELECT table_schema, table_name, pg_relation_size((quote_ident(table_schema) || '.' || quote_ident(table_name))::regclass) FROM information_schema.tables WHERE ...
The quote_ident()
function will take care of
double-quoting the identifiers where needed. The seemingly easier
SELECT pg_relation_size(table_name) FROM information_schema.tables WHERE ...
is not recommended, because it will fail for tables that are outside your search path or have names that require quoting.
An additional property of most of the OID alias types is the creation of
dependencies. If a
constant of one of these types appears in a stored expression
(such as a column default expression or view), it creates a dependency
on the referenced object. For example, if a column has a default
expression nextval('my_seq'::regclass)
,
PostgreSQL
understands that the default expression depends on the sequence
my_seq
, so the system will not let the sequence
be dropped without first removing the default expression. The
alternative of nextval('my_seq'::text)
does not
create a dependency.
(regrole
is an exception to this property. Constants of this
type are not allowed in stored expressions.)
Another identifier type used by the system is xid
, or transaction
(abbreviated xact) identifier. This is the data type of the system columns
xmin
and xmax
. Transaction identifiers are 32-bit quantities.
In some contexts, a 64-bit variant xid8
is used. Unlike
xid
values, xid8
values increase strictly
monotonically and cannot be reused in the lifetime of a database cluster.
A third identifier type used by the system is cid
, or
command identifier. This is the data type of the system columns
cmin
and cmax
. Command identifiers are also 32-bit quantities.
A final identifier type used by the system is tid
, or tuple
identifier (row identifier). This is the data type of the system column
ctid
. A tuple ID is a pair
(block number, tuple index within block) that identifies the
physical location of the row within its table.
(The system columns are further explained in Section 5.5.)
pg_lsn
Type
The pg_lsn
data type can be used to store LSN (Log Sequence
Number) data which is a pointer to a location in the WAL. This type is a
representation of XLogRecPtr
and an internal system type of
PostgreSQL.
Internally, an LSN is a 64-bit integer, representing a byte position in
the write-ahead log stream. It is printed as two hexadecimal numbers of
up to 8 digits each, separated by a slash; for example,
16/B374D848
. The pg_lsn
type supports the
standard comparison operators, like =
and
>
. Two LSNs can be subtracted using the
-
operator; the result is the number of bytes separating
those write-ahead log locations. Also the number of bytes can be
added into and subtracted from LSN using the
+(pg_lsn,numeric)
and
-(pg_lsn,numeric)
operators, respectively. Note that
the calculated LSN should be in the range of pg_lsn
type,
i.e., between 0/0
and
FFFFFFFF/FFFFFFFF
.
The PostgreSQL type system contains a number of special-purpose entries that are collectively called pseudo-types. A pseudo-type cannot be used as a column data type, but it can be used to declare a function's argument or result type. Each of the available pseudo-types is useful in situations where a function's behavior does not correspond to simply taking or returning a value of a specific SQL data type. Table 8.27 lists the existing pseudo-types.
Table 8.27. Pseudo-Types
Name | Description |
---|---|
any | Indicates that a function accepts any input data type. |
anyelement | Indicates that a function accepts any data type (see Section 38.2.5). |
anyarray | Indicates that a function accepts any array data type (see Section 38.2.5). |
anynonarray | Indicates that a function accepts any non-array data type (see Section 38.2.5). |
anyenum | Indicates that a function accepts any enum data type (see Section 38.2.5 and Section 8.7). |
anyrange | Indicates that a function accepts any range data type (see Section 38.2.5 and Section 8.17). |
anymultirange | Indicates that a function accepts any multirange data type (see Section 38.2.5 and Section 8.17). |
anycompatible | Indicates that a function accepts any data type, with automatic promotion of multiple arguments to a common data type (see Section 38.2.5). |
anycompatiblearray | Indicates that a function accepts any array data type, with automatic promotion of multiple arguments to a common data type (see Section 38.2.5). |
anycompatiblenonarray | Indicates that a function accepts any non-array data type, with automatic promotion of multiple arguments to a common data type (see Section 38.2.5). |
anycompatiblerange | Indicates that a function accepts any range data type, with automatic promotion of multiple arguments to a common data type (see Section 38.2.5 and Section 8.17). |
anycompatiblemultirange | Indicates that a function accepts any multirange data type, with automatic promotion of multiple arguments to a common data type (see Section 38.2.5 and Section 8.17). |
cstring | Indicates that a function accepts or returns a null-terminated C string. |
internal | Indicates that a function accepts or returns a server-internal data type. |
language_handler | A procedural language call handler is declared to return language_handler . |
fdw_handler | A foreign-data wrapper handler is declared to return fdw_handler . |
table_am_handler | A table access method handler is declared to return table_am_handler . |
index_am_handler | An index access method handler is declared to return index_am_handler . |
tsm_handler | A tablesample method handler is declared to return tsm_handler . |
record | Identifies a function taking or returning an unspecified row type. |
trigger | A trigger function is declared to return trigger. |
event_trigger | An event trigger function is declared to return event_trigger. |
pg_ddl_command | Identifies a representation of DDL commands that is available to event triggers. |
void | Indicates that a function returns no value. |
unknown | Identifies a not-yet-resolved type, e.g., of an undecorated string literal. |
Functions coded in C (whether built-in or dynamically loaded) can be declared to accept or return any of these pseudo-types. It is up to the function author to ensure that the function will behave safely when a pseudo-type is used as an argument type.
Functions coded in procedural languages can use pseudo-types only as
allowed by their implementation languages. At present most procedural
languages forbid use of a pseudo-type as an argument type, and allow
only void
and record
as a result type (plus
trigger
or event_trigger
when the function is used
as a trigger or event trigger). Some also support polymorphic functions
using the polymorphic pseudo-types, which are shown above and discussed
in detail in Section 38.2.5.
The internal
pseudo-type is used to declare functions
that are meant only to be called internally by the database
system, and not by direct invocation in an SQL
query. If a function has at least one internal
-type
argument then it cannot be called from SQL. To
preserve the type safety of this restriction it is important to
follow this coding rule: do not create any function that is
declared to return internal
unless it has at least one
internal
argument.
[7] For this purpose, the term “value” includes array elements, though JSON terminology sometimes considers array elements distinct from values within objects.
Table of Contents
PostgreSQL provides a large number of
functions and operators for the built-in data types. This chapter
describes most of them, although additional special-purpose functions
appear in relevant sections of the manual. Users can also
define their own functions and operators, as described in
Part V. The
psql commands \df
and
\do
can be used to list all
available functions and operators, respectively.
The notation used throughout this chapter to describe the argument and result data types of a function or operator is like this:
repeat
(text
,integer
) →text
which says that the function repeat
takes one text and
one integer argument and returns a result of type text. The right arrow
is also used to indicate the result of an example, thus:
repeat('Pg', 4) → PgPgPgPg
If you are concerned about portability then note that most of the functions and operators described in this chapter, with the exception of the most trivial arithmetic and comparison operators and some explicitly marked functions, are not specified by the SQL standard. Some of this extended functionality is present in other SQL database management systems, and in many cases this functionality is compatible and consistent between the various implementations.
The usual logical operators are available:
boolean
AND
boolean
→boolean
boolean
OR
boolean
→boolean
NOT
boolean
→boolean
SQL uses a three-valued logic system with true,
false, and null
, which represents “unknown”.
Observe the following truth tables:
a | b | a AND b | a OR b |
---|---|---|---|
TRUE | TRUE | TRUE | TRUE |
TRUE | FALSE | FALSE | TRUE |
TRUE | NULL | NULL | TRUE |
FALSE | FALSE | FALSE | FALSE |
FALSE | NULL | FALSE | NULL |
NULL | NULL | NULL | NULL |
a | NOT a |
---|---|
TRUE | FALSE |
FALSE | TRUE |
NULL | NULL |
The operators AND
and OR
are
commutative, that is, you can switch the left and right operands
without affecting the result. (However, it is not guaranteed that
the left operand is evaluated before the right operand. See Section 4.2.14 for more information about the
order of evaluation of subexpressions.)
The usual comparison operators are available, as shown in Table 9.1.
Table 9.1. Comparison Operators
Operator | Description |
---|---|
datatype < datatype
→ boolean
| Less than |
datatype > datatype
→ boolean
| Greater than |
datatype <= datatype
→ boolean
| Less than or equal to |
datatype >= datatype
→ boolean
| Greater than or equal to |
datatype = datatype
→ boolean
| Equal |
datatype <> datatype
→ boolean
| Not equal |
datatype != datatype
→ boolean
| Not equal |
<>
is the standard SQL notation for “not
equal”. !=
is an alias, which is converted
to <>
at a very early stage of parsing.
Hence, it is not possible to implement !=
and <>
operators that do different things.
These comparison operators are available for all built-in data types that have a natural ordering, including numeric, string, and date/time types. In addition, arrays, composite types, and ranges can be compared if their component data types are comparable.
It is usually possible to compare values of related data
types as well; for example integer
>
bigint
will work. Some cases of this sort are implemented
directly by “cross-type” comparison operators, but if no
such operator is available, the parser will coerce the less-general type
to the more-general type and apply the latter's comparison operator.
As shown above, all comparison operators are binary operators that
return values of type boolean
. Thus, expressions like
1 < 2 < 3
are not valid (because there is
no <
operator to compare a Boolean value with
3
). Use the BETWEEN
predicates
shown below to perform range tests.
There are also some comparison predicates, as shown in Table 9.2. These behave much like operators, but have special syntax mandated by the SQL standard.
Table 9.2. Comparison Predicates
Predicate Description Example(s) |
---|
Between (inclusive of the range endpoints).
|
Not between (the negation of
|
Between, after sorting the two endpoint values.
|
Not between, after sorting the two endpoint values.
|
Not equal, treating null as a comparable value.
|
Equal, treating null as a comparable value.
|
Test whether value is null.
|
Test whether value is not null.
|
Test whether value is null (nonstandard syntax). |
Test whether value is not null (nonstandard syntax). |
Test whether boolean expression yields true.
|
Test whether boolean expression yields false or unknown.
|
Test whether boolean expression yields false.
|
Test whether boolean expression yields true or unknown.
|
Test whether boolean expression yields unknown.
|
Test whether boolean expression yields true or false.
|
The BETWEEN
predicate simplifies range tests:
a
BETWEENx
ANDy
is equivalent to
a
>=x
ANDa
<=y
Notice that BETWEEN
treats the endpoint values as included
in the range.
BETWEEN SYMMETRIC
is like BETWEEN
except there is no requirement that the argument to the left of
AND
be less than or equal to the argument on the right.
If it is not, those two arguments are automatically swapped, so that
a nonempty range is always implied.
The various variants of BETWEEN
are implemented in
terms of the ordinary comparison operators, and therefore will work for
any data type(s) that can be compared.
The use of AND
in the BETWEEN
syntax creates an ambiguity with the use of AND
as a
logical operator. To resolve this, only a limited set of expression
types are allowed as the second argument of a BETWEEN
clause. If you need to write a more complex sub-expression
in BETWEEN
, write parentheses around the
sub-expression.
Ordinary comparison operators yield null (signifying “unknown”),
not true or false, when either input is null. For example,
7 = NULL
yields null, as does 7 <> NULL
. When
this behavior is not suitable, use the
IS [ NOT ] DISTINCT FROM
predicates:
a
IS DISTINCT FROMb
a
IS NOT DISTINCT FROMb
For non-null inputs, IS DISTINCT FROM
is
the same as the <>
operator. However, if both
inputs are null it returns false, and if only one input is
null it returns true. Similarly, IS NOT DISTINCT
FROM
is identical to =
for non-null
inputs, but it returns true when both inputs are null, and false when only
one input is null. Thus, these predicates effectively act as though null
were a normal data value, rather than “unknown”.
To check whether a value is or is not null, use the predicates:
expression
IS NULLexpression
IS NOT NULL
or the equivalent, but nonstandard, predicates:
expression
ISNULLexpression
NOTNULL
Do not write
because expression
= NULLNULL
is not “equal to”
NULL
. (The null value represents an unknown value,
and it is not known whether two unknown values are equal.)
Some applications might expect that
returns true if expression
= NULLexpression
evaluates to
the null value. It is highly recommended that these applications
be modified to comply with the SQL standard. However, if that
cannot be done the transform_null_equals
configuration variable is available. If it is enabled,
PostgreSQL will convert x =
NULL
clauses to x IS NULL
.
If the expression
is row-valued, then
IS NULL
is true when the row expression itself is null
or when all the row's fields are null, while
IS NOT NULL
is true when the row expression itself is non-null
and all the row's fields are non-null. Because of this behavior,
IS NULL
and IS NOT NULL
do not always return
inverse results for row-valued expressions; in particular, a row-valued
expression that contains both null and non-null fields will return false
for both tests. In some cases, it may be preferable to
write row
IS DISTINCT FROM NULL
or row
IS NOT DISTINCT FROM NULL
,
which will simply check whether the overall row value is null without any
additional tests on the row fields.
Boolean values can also be tested using the predicates
boolean_expression
IS TRUEboolean_expression
IS NOT TRUEboolean_expression
IS FALSEboolean_expression
IS NOT FALSEboolean_expression
IS UNKNOWNboolean_expression
IS NOT UNKNOWN
These will always return true or false, never a null value, even when the
operand is null.
A null input is treated as the logical value “unknown”.
Notice that IS UNKNOWN
and IS NOT UNKNOWN
are
effectively the same as IS NULL
and
IS NOT NULL
, respectively, except that the input
expression must be of Boolean type.
Some comparison-related functions are also available, as shown in Table 9.3.
Table 9.3. Comparison Functions
Mathematical operators are provided for many PostgreSQL types. For types without standard mathematical conventions (e.g., date/time types) we describe the actual behavior in subsequent sections.
Table 9.4 shows the mathematical
operators that are available for the standard numeric types.
Unless otherwise noted, operators shown as
accepting numeric_type
are available for all
the types smallint
, integer
,
bigint
, numeric
, real
,
and double precision
.
Operators shown as accepting integral_type
are available for the types smallint
, integer
,
and bigint
.
Except where noted, each form of an operator returns the same data type
as its argument(s). Calls involving multiple argument data types, such
as integer
+
numeric
,
are resolved by using the type appearing later in these lists.
Table 9.4. Mathematical Operators
Operator Description Example(s) |
---|
Addition
|
Unary plus (no operation)
|
Subtraction
|
Negation
|
Multiplication
|
Division (for integral types, division truncates the result towards zero)
|
Modulo (remainder); available for
|
Exponentiation
Unlike typical mathematical practice, multiple uses of
|
Square root
|
Cube root
|
Absolute value
|
Bitwise AND
|
Bitwise OR
|
Bitwise exclusive OR
|
Bitwise NOT
|
Bitwise shift left
|
Bitwise shift right
|
Table 9.5 shows the available
mathematical functions.
Many of these functions are provided in multiple forms with different
argument types.
Except where noted, any given form of a function returns the same
data type as its argument(s); cross-type cases are resolved in the
same way as explained above for operators.
The functions working with double precision
data are mostly
implemented on top of the host system's C library; accuracy and behavior in
boundary cases can therefore vary depending on the host system.
Table 9.5. Mathematical Functions
Table 9.6 shows functions for generating random numbers.
Table 9.6. Random Functions
The random()
function uses a simple linear
congruential algorithm. It is fast but not suitable for cryptographic
applications; see the pgcrypto module for a more
secure alternative.
If setseed()
is called, the series of results of
subsequent random()
calls in the current session
can be repeated by re-issuing setseed()
with the same
argument.
Without any prior setseed()
call in the same
session, the first random()
call obtains a seed
from a platform-dependent source of random bits.
Table 9.7 shows the available trigonometric functions. Each of these functions comes in two variants, one that measures angles in radians and one that measures angles in degrees.
Table 9.7. Trigonometric Functions
Another way to work with angles measured in degrees is to use the unit
transformation functions
and radians()
shown earlier.
However, using the degree-based trigonometric functions is preferred,
as that way avoids round-off error for special cases such
as degrees()
sind(30)
.
Table 9.8 shows the available hyperbolic functions.
Table 9.8. Hyperbolic Functions
This section describes functions and operators for examining and
manipulating string values. Strings in this context include values
of the types character
, character varying
,
and text
. Except where noted, these functions and operators
are declared to accept and return type text
. They will
interchangeably accept character varying
arguments.
Values of type character
will be converted
to text
before the function or operator is applied, resulting
in stripping any trailing spaces in the character
value.
SQL defines some string functions that use key words, rather than commas, to separate arguments. Details are in Table 9.9. PostgreSQL also provides versions of these functions that use the regular function invocation syntax (see Table 9.10).
The string concatenation operator (||
) will accept
non-string input, so long as at least one input is of string type, as shown
in Table 9.9. For other cases, inserting an
explicit coercion to text
can be used to have non-string input
accepted.
Table 9.9. SQL String Functions and Operators
Function/Operator Description Example(s) |
---|
Concatenates the two strings.
|
Converts the non-string input to text, then concatenates the two
strings. (The non-string input cannot be of an array type, because
that would create ambiguity with the array
|
Checks whether the string is in the specified Unicode normalization
form. The optional
|
Returns number of bits in the string (8
times the
|
Returns number of characters in the string.
|
Converts the string to all lower case, according to the rules of the database's locale.
|
Converts the string to the specified Unicode
normalization form. The optional
|
Returns number of bytes in the string.
|
Returns number of bytes in the string. Since this version of the
function accepts type
|
Replaces the substring of
|
Returns first starting index of the specified
|
Extracts the substring of
|
Extracts the first substring matching POSIX regular expression; see Section 9.7.3.
|
Extracts the first substring matching SQL regular expression; see Section 9.7.2. The first form has been specified since SQL:2003; the second form was only in SQL:1999 and should be considered obsolete.
|
Removes the longest string containing only characters in
|
This is a non-standard syntax for
|
Converts the string to all upper case, according to the rules of the database's locale.
|
Additional string manipulation functions are available and are listed in Table 9.10. Some of them are used internally to implement the SQL-standard string functions listed in Table 9.9.
Table 9.10. Other String Functions
Function Description Example(s) |
---|
Returns the numeric code of the first character of the argument. In UTF8 encoding, returns the Unicode code point of the character. In other multibyte encodings, the argument must be an ASCII character.
|
Removes the longest string containing only characters
in
|
Returns the character with the given code. In UTF8
encoding the argument is treated as a Unicode code point. In other
multibyte encodings the argument must designate
an ASCII character.
|
Concatenates the text representations of all the arguments. NULL arguments are ignored.
|
Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
|
Formats arguments according to a format string;
see Section 9.4.1.
This function is similar to the C function
|
Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
|
Returns first
|
Returns the number of characters in the string.
|
Extends the
|
Removes the longest string containing only characters in
|
Computes the MD5 hash of the argument, with the result written in hexadecimal.
|
Splits
|
Returns current client encoding name.
|
Returns the given string suitably quoted to be used as an identifier in an SQL statement string. Quotes are added only if necessary (i.e., if the string contains non-identifier characters or would be case-folded). Embedded quotes are properly doubled. See also Example 43.1.
|
Returns the given string suitably quoted to be used as a string literal
in an SQL statement string.
Embedded single-quotes and backslashes are properly doubled.
Note that
|
Converts the given value to text and then quotes it as a literal. Embedded single-quotes and backslashes are properly doubled.
|
Returns the given string suitably quoted to be used as a string literal
in an SQL statement string; or, if the argument
is null, returns
|
Converts the given value to text and then quotes it as a literal;
or, if the argument is null, returns
|
Returns captured substrings resulting from the first match of a POSIX
regular expression to the
|
Returns captured substrings resulting from the first match of a
POSIX regular expression to the
{bar} {baz}
|
Replaces substrings resulting from the first match of a
POSIX regular expression, or multiple substring matches
if the
|
Splits
|
Splits
hello world
|
Repeats
|
Replaces all occurrences in
|
Reverses the order of the characters in the string.
|
Returns last
|
Extends the
|
Removes the longest string containing only characters in
|
Splits
|
Returns first starting index of the specified
|
Extracts the substring of
|
Returns true if
|
Splits the
|
Splits the
xx NULL zz
|
Converts
|
Converts the number to its equivalent hexadecimal representation.
|
Replaces each character in
|
Evaluate escaped Unicode characters in the argument. Unicode characters
can be specified as
If the server encoding is not UTF-8, the Unicode code point identified by one of these escape sequences is converted to the actual server encoding; an error is reported if that's not possible. This function provides a (non-standard) alternative to string constants with Unicode escapes (see Section 4.1.2.3).
|
The concat
, concat_ws
and
format
functions are variadic, so it is possible to
pass the values to be concatenated or formatted as an array marked with
the VARIADIC
keyword (see Section 38.5.6). The array's elements are
treated as if they were separate ordinary arguments to the function.
If the variadic array argument is NULL, concat
and concat_ws
return NULL, but
format
treats a NULL as a zero-element array.
See also the aggregate function string_agg
in
Section 9.21, and the functions for
converting between strings and the bytea
type in
Table 9.13.
format
The function format
produces output formatted according to
a format string, in a style similar to the C function
sprintf
.
format
(formatstr
text
[,formatarg
"any"
[, ...] ])
formatstr
is a format string that specifies how the
result should be formatted. Text in the format string is copied
directly to the result, except where format specifiers are
used. Format specifiers act as placeholders in the string, defining how
subsequent function arguments should be formatted and inserted into the
result. Each formatarg
argument is converted to text
according to the usual output rules for its data type, and then formatted
and inserted into the result string according to the format specifier(s).
Format specifiers are introduced by a %
character and have
the form
%[position
][flags
][width
]type
where the component fields are:
position
(optional)
A string of the form
where
n
$n
is the index of the argument to print.
Index 1 means the first argument after
formatstr
. If the position
is
omitted, the default is to use the next argument in sequence.
flags
(optional)
Additional options controlling how the format specifier's output is
formatted. Currently the only supported flag is a minus sign
(-
) which will cause the format specifier's output to be
left-justified. This has no effect unless the width
field is also specified.
width
(optional)
Specifies the minimum number of characters to use to
display the format specifier's output. The output is padded on the
left or right (depending on the -
flag) with spaces as
needed to fill the width. A too-small width does not cause
truncation of the output, but is simply ignored. The width may be
specified using any of the following: a positive integer; an
asterisk (*
) to use the next function argument as the
width; or a string of the form *
to
use the n
$n
th function argument as the width.
If the width comes from a function argument, that argument is
consumed before the argument that is used for the format specifier's
value. If the width argument is negative, the result is left
aligned (as if the -
flag had been specified) within a
field of length abs
(width
).
type
(required)The type of format conversion to use to produce the format specifier's output. The following types are supported:
s
formats the argument value as a simple
string. A null value is treated as an empty string.
I
treats the argument value as an SQL
identifier, double-quoting it if necessary.
It is an error for the value to be null (equivalent to
quote_ident
).
L
quotes the argument value as an SQL literal.
A null value is displayed as the string NULL
, without
quotes (equivalent to quote_nullable
).
In addition to the format specifiers described above, the special sequence
%%
may be used to output a literal %
character.
Here are some examples of the basic format conversions:
SELECT format('Hello %s', 'World'); Result:Hello World
SELECT format('Testing %s, %s, %s, %%', 'one', 'two', 'three'); Result:Testing one, two, three, %
SELECT format('INSERT INTO %I VALUES(%L)', 'Foo bar', E'O\'Reilly'); Result:INSERT INTO "Foo bar" VALUES('O''Reilly')
SELECT format('INSERT INTO %I VALUES(%L)', 'locations', 'C:\Program Files'); Result:INSERT INTO locations VALUES('C:\Program Files')
Here are examples using width
fields
and the -
flag:
SELECT format('|%10s|', 'foo'); Result:| foo|
SELECT format('|%-10s|', 'foo'); Result:|foo |
SELECT format('|%*s|', 10, 'foo'); Result:| foo|
SELECT format('|%*s|', -10, 'foo'); Result:|foo |
SELECT format('|%-*s|', 10, 'foo'); Result:|foo |
SELECT format('|%-*s|', -10, 'foo'); Result:|foo |
These examples show use of position
fields:
SELECT format('Testing %3$s, %2$s, %1$s', 'one', 'two', 'three'); Result:Testing three, two, one
SELECT format('|%*2$s|', 'foo', 10, 'bar'); Result:| bar|
SELECT format('|%1$*2$s|', 'foo', 10, 'bar'); Result:| foo|
Unlike the standard C function sprintf
,
PostgreSQL's format
function allows format
specifiers with and without position
fields to be mixed
in the same format string. A format specifier without a
position
field always uses the next argument after the
last argument consumed.
In addition, the format
function does not require all
function arguments to be used in the format string.
For example:
SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
Result: Testing three, two, three
The %I
and %L
format specifiers are particularly
useful for safely constructing dynamic SQL statements. See
Example 43.1.
This section describes functions and operators for examining and
manipulating binary strings, that is values of type bytea
.
Many of these are equivalent, in purpose and syntax, to the
text-string functions described in the previous section.
SQL defines some string functions that use key words, rather than commas, to separate arguments. Details are in Table 9.11. PostgreSQL also provides versions of these functions that use the regular function invocation syntax (see Table 9.12).
Table 9.11. SQL Binary String Functions and Operators
Additional binary string manipulation functions are available and are listed in Table 9.12. Some of them are used internally to implement the SQL-standard string functions listed in Table 9.11.
Table 9.12. Other Binary String Functions
Function Description Example(s) |
---|
Returns the number of bits set in the binary string (also known as “popcount”).
|
Removes the longest string containing only bytes appearing in
|
Extracts n'th bit from binary string.
|
Extracts n'th byte from binary string.
|
Returns the number of bytes in the binary string.
|
Returns the number of characters in the binary string, assuming
that it is text in the given
|
Removes the longest string containing only bytes appearing in
|
Computes the MD5 hash of the binary string, with the result written in hexadecimal.
|
Removes the longest string containing only bytes appearing in
|
Sets n'th bit in
binary string to
|
Sets n'th byte in
binary string to
|
Computes the SHA-224 hash of the binary string.
|
Computes the SHA-256 hash of the binary string.
|
Computes the SHA-384 hash of the binary string.
|
Computes the SHA-512 hash of the binary string.
|
Extracts the substring of
|
Functions get_byte
and set_byte
number the first byte of a binary string as byte 0.
Functions get_bit
and set_bit
number bits from the right within each byte; for example bit 0 is the least
significant bit of the first byte, and bit 15 is the most significant bit
of the second byte.
For historical reasons, the function md5
returns a hex-encoded value of type text
whereas the SHA-2
functions return type bytea
. Use the functions
encode
and decode
to
convert between the two. For example write encode(sha256('abc'),
'hex')
to get a hex-encoded text representation,
or decode(md5('abc'), 'hex')
to get
a bytea
value.
Functions for converting strings between different character sets
(encodings), and for representing arbitrary binary data in textual
form, are shown in
Table 9.13. For these
functions, an argument or result of type text
is expressed
in the database's default encoding, while arguments or results of
type bytea
are in an encoding named by another argument.
Table 9.13. Text/Binary String Conversion Functions
Function Description Example(s) |
---|
Converts a binary string representing text in
encoding
|
Converts a binary string representing text in
encoding
|
Converts a
|
Encodes binary data into a textual representation; supported
|
Decodes binary data from a textual representation; supported
|
The encode
and decode
functions support the following textual formats:
The base64
format is that
of RFC
2045 Section 6.8. As per the RFC, encoded lines are
broken at 76 characters. However instead of the MIME CRLF
end-of-line marker, only a newline is used for end-of-line.
The decode
function ignores carriage-return,
newline, space, and tab characters. Otherwise, an error is
raised when decode
is supplied invalid
base64 data — including when trailing padding is incorrect.
The escape
format converts zero bytes and
bytes with the high bit set into octal escape sequences
(\
nnn
), and it doubles
backslashes. Other byte values are represented literally.
The decode
function will raise an error if a
backslash is not followed by either a second backslash or three
octal digits; it accepts other byte values unchanged.
The hex
format represents each 4 bits of
data as one hexadecimal digit, 0
through f
, writing the higher-order digit of
each byte first. The encode
function outputs
the a
-f
hex digits in lower
case. Because the smallest unit of data is 8 bits, there are
always an even number of characters returned
by encode
.
The decode
function
accepts the a
-f
characters in
either upper or lower case. An error is raised
when decode
is given invalid hex data
— including when given an odd number of characters.
See also the aggregate function string_agg
in
Section 9.21 and the large object functions
in Section 35.4.
This section describes functions and operators for examining and
manipulating bit strings, that is values of the types
bit
and bit varying
. (While only
type bit
is mentioned in these tables, values of
type bit varying
can be used interchangeably.)
Bit strings support the usual comparison operators shown in
Table 9.1, as well as the
operators shown in Table 9.14.
Table 9.14. Bit String Operators
Operator Description Example(s) |
---|
Concatenation
|
Bitwise AND (inputs must be of equal length)
|
Bitwise OR (inputs must be of equal length)
|
Bitwise exclusive OR (inputs must be of equal length)
|
Bitwise NOT
|
Bitwise shift left (string length is preserved)
|
Bitwise shift right (string length is preserved)
|
Some of the functions available for binary strings are also available for bit strings, as shown in Table 9.15.
Table 9.15. Bit String Functions
In addition, it is possible to cast integral values to and from type
bit
.
Casting an integer to bit(n)
copies the rightmost
n
bits. Casting an integer to a bit string width wider
than the integer itself will sign-extend on the left.
Some examples:
44::bit(10) 0000101100 44::bit(3) 100 cast(-44 as bit(12)) 111111010100 '1110'::bit(4)::integer 14
Note that casting to just “bit” means casting to
bit(1)
, and so will deliver only the least significant
bit of the integer.
There are three separate approaches to pattern matching provided
by PostgreSQL: the traditional
SQL LIKE
operator, the
more recent SIMILAR TO
operator (added in
SQL:1999), and POSIX-style regular
expressions. Aside from the basic “does this string match
this pattern?” operators, functions are available to extract
or replace matching substrings and to split a string at matching
locations.
If you have pattern matching needs that go beyond this, consider writing a user-defined function in Perl or Tcl.
While most regular-expression searches can be executed very quickly, regular expressions can be contrived that take arbitrary amounts of time and memory to process. Be wary of accepting regular-expression search patterns from hostile sources. If you must do so, it is advisable to impose a statement timeout.
Searches using SIMILAR TO
patterns have the same
security hazards, since SIMILAR TO
provides many
of the same capabilities as POSIX-style regular
expressions.
LIKE
searches, being much simpler than the other
two options, are safer to use with possibly-hostile pattern sources.
The pattern matching operators of all three kinds do not support nondeterministic collations. If required, apply a different collation to the expression to work around this limitation.
LIKE
string
LIKEpattern
[ESCAPEescape-character
]string
NOT LIKEpattern
[ESCAPEescape-character
]
The LIKE
expression returns true if the
string
matches the supplied
pattern
. (As
expected, the NOT LIKE
expression returns
false if LIKE
returns true, and vice versa.
An equivalent expression is
NOT (
.)
string
LIKE
pattern
)
If pattern
does not contain percent
signs or underscores, then the pattern only represents the string
itself; in that case LIKE
acts like the
equals operator. An underscore (_
) in
pattern
stands for (matches) any single
character; a percent sign (%
) matches any sequence
of zero or more characters.
Some examples:
'abc' LIKE 'abc' true 'abc' LIKE 'a%' true 'abc' LIKE '_b_' true 'abc' LIKE 'c' false
LIKE
pattern matching always covers the entire
string. Therefore, if it's desired to match a sequence anywhere within
a string, the pattern must start and end with a percent sign.
To match a literal underscore or percent sign without matching
other characters, the respective character in
pattern
must be
preceded by the escape character. The default escape
character is the backslash but a different one can be selected by
using the ESCAPE
clause. To match the escape
character itself, write two escape characters.
If you have standard_conforming_strings turned off, any backslashes you write in literal string constants will need to be doubled. See Section 4.1.2.1 for more information.
It's also possible to select no escape character by writing
ESCAPE ''
. This effectively disables the
escape mechanism, which makes it impossible to turn off the
special meaning of underscore and percent signs in the pattern.
According to the SQL standard, omitting ESCAPE
means there is no escape character (rather than defaulting to a
backslash), and a zero-length ESCAPE
value is
disallowed. PostgreSQL's behavior in
this regard is therefore slightly nonstandard.
The key word ILIKE
can be used instead of
LIKE
to make the match case-insensitive according
to the active locale. This is not in the SQL standard but is a
PostgreSQL extension.
The operator ~~
is equivalent to
LIKE
, and ~~*
corresponds to
ILIKE
. There are also
!~~
and !~~*
operators that
represent NOT LIKE
and NOT
ILIKE
, respectively. All of these operators are
PostgreSQL-specific. You may see these
operator names in EXPLAIN
output and similar
places, since the parser actually translates LIKE
et al. to these operators.
The phrases LIKE
, ILIKE
,
NOT LIKE
, and NOT ILIKE
are
generally treated as operators
in PostgreSQL syntax; for example they can
be used in expression
operator
ANY
(subquery
) constructs, although
an ESCAPE
clause cannot be included there. In some
obscure cases it may be necessary to use the underlying operator names
instead.
Also see the prefix operator ^@
and corresponding
starts_with
function, which are useful in cases
where simply matching the beginning of a string is needed.
SIMILAR TO
Regular Expressionsstring
SIMILAR TOpattern
[ESCAPEescape-character
]string
NOT SIMILAR TOpattern
[ESCAPEescape-character
]
The SIMILAR TO
operator returns true or
false depending on whether its pattern matches the given string.
It is similar to LIKE
, except that it
interprets the pattern using the SQL standard's definition of a
regular expression. SQL regular expressions are a curious cross
between LIKE
notation and common (POSIX) regular
expression notation.
Like LIKE
, the SIMILAR TO
operator succeeds only if its pattern matches the entire string;
this is unlike common regular expression behavior where the pattern
can match any part of the string.
Also like
LIKE
, SIMILAR TO
uses
_
and %
as wildcard characters denoting
any single character and any string, respectively (these are
comparable to .
and .*
in POSIX regular
expressions).
In addition to these facilities borrowed from LIKE
,
SIMILAR TO
supports these pattern-matching
metacharacters borrowed from POSIX regular expressions:
|
denotes alternation (either of two alternatives).
*
denotes repetition of the previous item zero
or more times.
+
denotes repetition of the previous item one
or more times.
?
denotes repetition of the previous item zero
or one time.
{
m
}
denotes repetition
of the previous item exactly m
times.
{
m
,}
denotes repetition
of the previous item m
or more times.
{
m
,
n
}
denotes repetition of the previous item at least m
and
not more than n
times.
Parentheses ()
can be used to group items into
a single logical item.
A bracket expression [...]
specifies a character
class, just as in POSIX regular expressions.
Notice that the period (.
) is not a metacharacter
for SIMILAR TO
.
As with LIKE
, a backslash disables the special
meaning of any of these metacharacters. A different escape character
can be specified with ESCAPE
, or the escape
capability can be disabled by writing ESCAPE ''
.
According to the SQL standard, omitting ESCAPE
means there is no escape character (rather than defaulting to a
backslash), and a zero-length ESCAPE
value is
disallowed. PostgreSQL's behavior in
this regard is therefore slightly nonstandard.
Another nonstandard extension is that following the escape character with a letter or digit provides access to the escape sequences defined for POSIX regular expressions; see Table 9.20, Table 9.21, and Table 9.22 below.
Some examples:
'abc' SIMILAR TO 'abc' true 'abc' SIMILAR TO 'a' false 'abc' SIMILAR TO '%(b|d)%' true 'abc' SIMILAR TO '(b|c)%' false '-abc-' SIMILAR TO '%\mabc\M%' true 'xabcy' SIMILAR TO '%\mabc\M%' false
The substring
function with three parameters
provides extraction of a substring that matches an SQL
regular expression pattern. The function can be written according
to standard SQL syntax:
substring(string
similarpattern
escapeescape-character
)
or using the now obsolete SQL:1999 syntax:
substring(string
frompattern
forescape-character
)
or as a plain three-argument function:
substring(string
,pattern
,escape-character
)
As with SIMILAR TO
, the
specified pattern must match the entire data string, or else the
function fails and returns null. To indicate the part of the
pattern for which the matching data sub-string is of interest,
the pattern should contain
two occurrences of the escape character followed by a double quote
("
).
The text matching the portion of the pattern
between these separators is returned when the match is successful.
The escape-double-quote separators actually
divide substring
's pattern into three independent
regular expressions; for example, a vertical bar (|
)
in any of the three sections affects only that section. Also, the first
and third of these regular expressions are defined to match the smallest
possible amount of text, not the largest, when there is any ambiguity
about how much of the data string matches which pattern. (In POSIX
parlance, the first and third regular expressions are forced to be
non-greedy.)
As an extension to the SQL standard, PostgreSQL allows there to be just one escape-double-quote separator, in which case the third regular expression is taken as empty; or no separators, in which case the first and third regular expressions are taken as empty.
Some examples, with #"
delimiting the return string:
substring('foobar' similar '%#"o_b#"%' escape '#') oob substring('foobar' similar '#"o_b#"%' escape '#') NULL
Table 9.16 lists the available operators for pattern matching using POSIX regular expressions.
Table 9.16. Regular Expression Match Operators
Operator Description Example(s) |
---|
String matches regular expression, case sensitively
|
String matches regular expression, case insensitively
|
String does not match regular expression, case sensitively
|
String does not match regular expression, case insensitively
|
POSIX regular expressions provide a more
powerful means for pattern matching than the LIKE
and
SIMILAR TO
operators.
Many Unix tools such as egrep
,
sed
, or awk
use a pattern
matching language that is similar to the one described here.
A regular expression is a character sequence that is an
abbreviated definition of a set of strings (a regular
set). A string is said to match a regular expression
if it is a member of the regular set described by the regular
expression. As with LIKE
, pattern characters
match string characters exactly unless they are special characters
in the regular expression language — but regular expressions use
different special characters than LIKE
does.
Unlike LIKE
patterns, a
regular expression is allowed to match anywhere within a string, unless
the regular expression is explicitly anchored to the beginning or
end of the string.
Some examples:
'abcd' ~ 'bc' true 'abcd' ~ 'a.c' true — dot matches any character 'abcd' ~ 'a.*d' true —*
repeats the preceding pattern item 'abcd' ~ '(b|x)' true —|
means OR, parentheses group 'abcd' ~ '^a' true —^
anchors to start of string 'abcd' ~ '^(b|c)' false — would match except for anchoring
The POSIX pattern language is described in much greater detail below.
The substring
function with two parameters,
substring(
, provides extraction of a
substring
that matches a POSIX regular expression pattern. It returns null if
there is no match, otherwise the first portion of the text that matched the
pattern. But if the pattern contains any parentheses, the portion
of the text that matched the first parenthesized subexpression (the
one whose left parenthesis comes first) is
returned. You can put parentheses around the whole expression
if you want to use parentheses within it without triggering this
exception. If you need parentheses in the pattern before the
subexpression you want to extract, see the non-capturing parentheses
described below.
string
from
pattern
)
Some examples:
substring('foobar' from 'o.b') oob substring('foobar' from 'o(.)b') o
The regexp_replace
function provides substitution of
new text for substrings that match POSIX regular expression patterns.
It has the syntax
regexp_replace
(source
,
pattern
, replacement
[, flags
]).
The source
string is returned unchanged if
there is no match to the pattern
. If there is a
match, the source
string is returned with the
replacement
string substituted for the matching
substring. The replacement
string can contain
\
n
, where n
is 1
through 9, to indicate that the source substring matching the
n
'th parenthesized subexpression of the pattern should be
inserted, and it can contain \&
to indicate that the
substring matching the entire pattern should be inserted. Write
\\
if you need to put a literal backslash in the replacement
text.
The flags
parameter is an optional text
string containing zero or more single-letter flags that change the
function's behavior. Flag i
specifies case-insensitive
matching, while flag g
specifies replacement of each matching
substring rather than only the first one. Supported flags (though
not g
) are
described in Table 9.24.
Some examples:
regexp_replace('foobarbaz', 'b..', 'X') fooXbaz regexp_replace('foobarbaz', 'b..', 'X', 'g') fooXX regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g') fooXarYXazY
The regexp_match
function returns a text array of
captured substring(s) resulting from the first match of a POSIX
regular expression pattern to a string. It has the syntax
regexp_match
(string
,
pattern
[, flags
]).
If there is no match, the result is NULL
.
If a match is found, and the pattern
contains no
parenthesized subexpressions, then the result is a single-element text
array containing the substring matching the whole pattern.
If a match is found, and the pattern
contains
parenthesized subexpressions, then the result is a text array
whose n
'th element is the substring matching
the n
'th parenthesized subexpression of
the pattern
(not counting “non-capturing”
parentheses; see below for details).
The flags
parameter is an optional text string
containing zero or more single-letter flags that change the function's
behavior. Supported flags are described
in Table 9.24.
Some examples:
SELECT regexp_match('foobarbequebaz', 'bar.*que'); regexp_match -------------- {barbeque} (1 row) SELECT regexp_match('foobarbequebaz', '(bar)(beque)'); regexp_match -------------- {bar,beque} (1 row)
In the common case where you just want the whole matching substring
or NULL
for no match, write something like
SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1]; regexp_match -------------- barbeque (1 row)
The regexp_matches
function returns a set of text arrays
of captured substring(s) resulting from matching a POSIX regular
expression pattern to a string. It has the same syntax as
regexp_match
.
This function returns no rows if there is no match, one row if there is
a match and the g
flag is not given, or N
rows if there are N
matches and the g
flag
is given. Each returned row is a text array containing the whole
matched substring or the substrings matching parenthesized
subexpressions of the pattern
, just as described above
for regexp_match
.
regexp_matches
accepts all the flags shown
in Table 9.24, plus
the g
flag which commands it to return all matches, not
just the first one.
Some examples:
SELECT regexp_matches('foo', 'not there'); regexp_matches ---------------- (0 rows) SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g'); regexp_matches ---------------- {bar,beque} {bazil,barf} (2 rows)
In most cases regexp_matches()
should be used with
the g
flag, since if you only want the first match, it's
easier and more efficient to use regexp_match()
.
However, regexp_match()
only exists
in PostgreSQL version 10 and up. When working in older
versions, a common trick is to place a regexp_matches()
call in a sub-select, for example:
SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
This produces a text array if there's a match, or NULL
if
not, the same as regexp_match()
would do. Without the
sub-select, this query would produce no output at all for table rows
without a match, which is typically not the desired behavior.
The regexp_split_to_table
function splits a string using a POSIX
regular expression pattern as a delimiter. It has the syntax
regexp_split_to_table
(string
, pattern
[, flags
]).
If there is no match to the pattern
, the function returns the
string
. If there is at least one match, for each match it returns
the text from the end of the last match (or the beginning of the string)
to the beginning of the match. When there are no more matches, it
returns the text from the end of the last match to the end of the string.
The flags
parameter is an optional text string containing
zero or more single-letter flags that change the function's behavior.
regexp_split_to_table
supports the flags described in
Table 9.24.
The regexp_split_to_array
function behaves the same as
regexp_split_to_table
, except that regexp_split_to_array
returns its result as an array of text
. It has the syntax
regexp_split_to_array
(string
, pattern
[, flags
]).
The parameters are the same as for regexp_split_to_table
.
Some examples:
SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo; foo ------- the quick brown fox jumps over the lazy dog (9 rows) SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+'); regexp_split_to_array ----------------------------------------------- {the,quick,brown,fox,jumps,over,the,lazy,dog} (1 row) SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; foo ----- t h e q u i c k b r o w n f o x (16 rows)
As the last example demonstrates, the regexp split functions ignore
zero-length matches that occur at the start or end of the string
or immediately after a previous match. This is contrary to the strict
definition of regexp matching that is implemented by
regexp_match
and
regexp_matches
, but is usually the most convenient behavior
in practice. Other software systems such as Perl use similar definitions.
PostgreSQL's regular expressions are implemented using a software package written by Henry Spencer. Much of the description of regular expressions below is copied verbatim from his manual.
Regular expressions (REs), as defined in
POSIX 1003.2, come in two forms:
extended REs or EREs
(roughly those of egrep
), and
basic REs or BREs
(roughly those of ed
).
PostgreSQL supports both forms, and
also implements some extensions
that are not in the POSIX standard, but have become widely used
due to their availability in programming languages such as Perl and Tcl.
REs using these non-POSIX extensions are called
advanced REs or AREs
in this documentation. AREs are almost an exact superset of EREs,
but BREs have several notational incompatibilities (as well as being
much more limited).
We first describe the ARE and ERE forms, noting features that apply
only to AREs, and then describe how BREs differ.
PostgreSQL always initially presumes that a regular expression follows the ARE rules. However, the more limited ERE or BRE rules can be chosen by prepending an embedded option to the RE pattern, as described in Section 9.7.3.4. This can be useful for compatibility with applications that expect exactly the POSIX 1003.2 rules.
A regular expression is defined as one or more
branches, separated by
|
. It matches anything that matches one of the
branches.
A branch is zero or more quantified atoms or constraints, concatenated. It matches a match for the first, followed by a match for the second, etc; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a match for the atom. With a quantifier, it can match some number of matches of the atom. An atom can be any of the possibilities shown in Table 9.17. The possible quantifiers and their meanings are shown in Table 9.18.
A constraint matches an empty string, but matches only when specific conditions are met. A constraint can be used where an atom could be used, except it cannot be followed by a quantifier. The simple constraints are shown in Table 9.19; some more constraints are described later.
Table 9.17. Regular Expression Atoms
Atom | Description |
---|---|
( re ) | (where re is any regular expression)
matches a match for
re , with the match noted for possible reporting |
(?: re ) | as above, but the match is not noted for reporting (a “non-capturing” set of parentheses) (AREs only) |
. | matches any single character |
[ chars ] | a bracket expression,
matching any one of the chars (see
Section 9.7.3.2 for more detail) |
\ k | (where k is a non-alphanumeric character)
matches that character taken as an ordinary character,
e.g., \\ matches a backslash character |
\ c | where c is alphanumeric
(possibly followed by other characters)
is an escape, see Section 9.7.3.3
(AREs only; in EREs and BREs, this matches c ) |
{ | when followed by a character other than a digit,
matches the left-brace character { ;
when followed by a digit, it is the beginning of a
bound (see below) |
x | where x is a single character with no other
significance, matches that character |
An RE cannot end with a backslash (\
).
If you have standard_conforming_strings turned off, any backslashes you write in literal string constants will need to be doubled. See Section 4.1.2.1 for more information.
Table 9.18. Regular Expression Quantifiers
Quantifier | Matches |
---|---|
* | a sequence of 0 or more matches of the atom |
+ | a sequence of 1 or more matches of the atom |
? | a sequence of 0 or 1 matches of the atom |
{ m } | a sequence of exactly m matches of the atom |
{ m ,} | a sequence of m or more matches of the atom |
{ m , n } | a sequence of m through n
(inclusive) matches of the atom; m cannot exceed
n |
*? | non-greedy version of * |
+? | non-greedy version of + |
?? | non-greedy version of ? |
{ m }? | non-greedy version of { m } |
{ m ,}? | non-greedy version of { m ,} |
{ m , n }? | non-greedy version of { m , n } |
The forms using {
...
}
are known as bounds.
The numbers m
and n
within a bound are
unsigned decimal integers with permissible values from 0 to 255 inclusive.
Non-greedy quantifiers (available in AREs only) match the same possibilities as their corresponding normal (greedy) counterparts, but prefer the smallest number rather than the largest number of matches. See Section 9.7.3.5 for more detail.
A quantifier cannot immediately follow another quantifier, e.g.,
**
is invalid.
A quantifier cannot
begin an expression or subexpression or follow
^
or |
.
Table 9.19. Regular Expression Constraints
Constraint | Description |
---|---|
^ | matches at the beginning of the string |
$ | matches at the end of the string |
(?= re ) | positive lookahead matches at any point
where a substring matching re begins
(AREs only) |
(?! re ) | negative lookahead matches at any point
where no substring matching re begins
(AREs only) |
(?<= re ) | positive lookbehind matches at any point
where a substring matching re ends
(AREs only) |
(?<! re ) | negative lookbehind matches at any point
where no substring matching re ends
(AREs only) |
Lookahead and lookbehind constraints cannot contain back references (see Section 9.7.3.3), and all parentheses within them are considered non-capturing.
A bracket expression is a list of
characters enclosed in []
. It normally matches
any single character from the list (but see below). If the list
begins with ^
, it matches any single character
not from the rest of the list.
If two characters
in the list are separated by -
, this is
shorthand for the full range of characters between those two
(inclusive) in the collating sequence,
e.g., [0-9]
in ASCII matches
any decimal digit. It is illegal for two ranges to share an
endpoint, e.g., a-c-e
. Ranges are very
collating-sequence-dependent, so portable programs should avoid
relying on them.
To include a literal ]
in the list, make it the
first character (after ^
, if that is used). To
include a literal -
, make it the first or last
character, or the second endpoint of a range. To use a literal
-
as the first endpoint of a range, enclose it
in [.
and .]
to make it a
collating element (see below). With the exception of these characters,
some combinations using [
(see next paragraphs), and escapes (AREs only), all other special
characters lose their special significance within a bracket expression.
In particular, \
is not special when following
ERE or BRE rules, though it is special (as introducing an escape)
in AREs.
Within a bracket expression, a collating element (a character, a
multiple-character sequence that collates as if it were a single
character, or a collating-sequence name for either) enclosed in
[.
and .]
stands for the
sequence of characters of that collating element. The sequence is
treated as a single element of the bracket expression's list. This
allows a bracket
expression containing a multiple-character collating element to
match more than one character, e.g., if the collating sequence
includes a ch
collating element, then the RE
[[.ch.]]*c
matches the first five characters of
chchcc
.
PostgreSQL currently does not support multi-character collating elements. This information describes possible future behavior.
Within a bracket expression, a collating element enclosed in
[=
and =]
is an equivalence
class, standing for the sequences of characters of all collating
elements equivalent to that one, including itself. (If there are
no other equivalent collating elements, the treatment is as if the
enclosing delimiters were [.
and
.]
.) For example, if o
and
^
are the members of an equivalence class, then
[[=o=]]
, [[=^=]]
, and
[o^]
are all synonymous. An equivalence class
cannot be an endpoint of a range.
Within a bracket expression, the name of a character class
enclosed in [:
and :]
stands
for the list of all characters belonging to that class. A character
class cannot be used as an endpoint of a range.
The POSIX standard defines these character class
names:
alnum
(letters and numeric digits),
alpha
(letters),
blank
(space and tab),
cntrl
(control characters),
digit
(numeric digits),
graph
(printable characters except space),
lower
(lower-case letters),
print
(printable characters including space),
punct
(punctuation),
space
(any white space),
upper
(upper-case letters),
and xdigit
(hexadecimal digits).
The behavior of these standard character classes is generally
consistent across platforms for characters in the 7-bit ASCII set.
Whether a given non-ASCII character is considered to belong to one
of these classes depends on the collation
that is used for the regular-expression function or operator
(see Section 24.2), or by default on the
database's LC_CTYPE
locale setting (see
Section 24.1). The classification of non-ASCII
characters can vary across platforms even in similarly-named
locales. (But the C
locale never considers any
non-ASCII characters to belong to any of these classes.)
In addition to these standard character
classes, PostgreSQL defines
the word
character class, which is the same as
alnum
plus the underscore (_
)
character, and
the ascii
character class, which contains exactly
the 7-bit ASCII set.
There are two special cases of bracket expressions: the bracket
expressions [[:<:]]
and
[[:>:]]
are constraints,
matching empty strings at the beginning
and end of a word respectively. A word is defined as a sequence
of word characters that is neither preceded nor followed by word
characters. A word character is any character belonging to the
word
character class, that is, any letter, digit,
or underscore. This is an extension, compatible with but not
specified by POSIX 1003.2, and should be used with
caution in software intended to be portable to other systems.
The constraint escapes described below are usually preferable; they
are no more standard, but are easier to type.
Escapes are special sequences beginning with \
followed by an alphanumeric character. Escapes come in several varieties:
character entry, class shorthands, constraint escapes, and back references.
A \
followed by an alphanumeric character but not constituting
a valid escape is illegal in AREs.
In EREs, there are no escapes: outside a bracket expression,
a \
followed by an alphanumeric character merely stands for
that character as an ordinary character, and inside a bracket expression,
\
is an ordinary character.
(The latter is the one actual incompatibility between EREs and AREs.)
Character-entry escapes exist to make it easier to specify non-printing and other inconvenient characters in REs. They are shown in Table 9.20.
Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are shown in Table 9.21.
A constraint escape is a constraint, matching the empty string if specific conditions are met, written as an escape. They are shown in Table 9.22.
A back reference (\
n
) matches the
same string matched by the previous parenthesized subexpression specified
by the number n
(see Table 9.23). For example,
([bc])\1
matches bb
or cc
but not bc
or cb
.
The subexpression must entirely precede the back reference in the RE.
Subexpressions are numbered in the order of their leading parentheses.
Non-capturing parentheses do not define subexpressions.
The back reference considers only the string characters matched by the
referenced subexpression, not any constraints contained in it. For
example, (^\d)\1
will match 22
.
Table 9.20. Regular Expression Character-Entry Escapes
Escape | Description |
---|---|
\a | alert (bell) character, as in C |
\b | backspace, as in C |
\B | synonym for backslash (\ ) to help reduce the need for backslash
doubling |
\c X | (where X is any character) the character whose
low-order 5 bits are the same as those of
X , and whose other bits are all zero |
\e | the character whose collating-sequence name
is ESC ,
or failing that, the character with octal value 033 |
\f | form feed, as in C |
\n | newline, as in C |
\r | carriage return, as in C |
\t | horizontal tab, as in C |
\u wxyz | (where wxyz is exactly four hexadecimal digits)
the character whose hexadecimal value is
0x wxyz
|
\U stuvwxyz | (where stuvwxyz is exactly eight hexadecimal
digits)
the character whose hexadecimal value is
0x stuvwxyz
|
\v | vertical tab, as in C |
\x hhh | (where hhh is any sequence of hexadecimal
digits)
the character whose hexadecimal value is
0x hhh
(a single character no matter how many hexadecimal digits are used)
|
\0 | the character whose value is 0 (the null byte) |
\ xy | (where xy is exactly two octal digits,
and is not a back reference)
the character whose octal value is
0 xy |
\ xyz | (where xyz is exactly three octal digits,
and is not a back reference)
the character whose octal value is
0 xyz |
Hexadecimal digits are 0
-9
,
a
-f
, and A
-F
.
Octal digits are 0
-7
.
Numeric character-entry escapes specifying values outside the ASCII range
(0–127) have meanings dependent on the database encoding. When the
encoding is UTF-8, escape values are equivalent to Unicode code points,
for example \u1234
means the character U+1234
.
For other multibyte encodings, character-entry escapes usually just
specify the concatenation of the byte values for the character. If the
escape value does not correspond to any legal character in the database
encoding, no error will be raised, but it will never match any data.
The character-entry escapes are always taken as ordinary characters.
For example, \135
is ]
in ASCII, but
\135
does not terminate a bracket expression.
Table 9.21. Regular Expression Class-Shorthand Escapes
Escape | Description |
---|---|
\d | matches any digit, like
[[:digit:]] |
\s | matches any whitespace character, like
[[:space:]] |
\w | matches any word character, like
[[:word:]] |
\D | matches any non-digit, like
[^[:digit:]] |
\S | matches any non-whitespace character, like
[^[:space:]] |
\W | matches any non-word character, like
[^[:word:]] |
The class-shorthand escapes also work within bracket expressions,
although the definitions shown above are not quite syntactically
valid in that context.
For example, [a-c\d]
is equivalent to
[a-c[:digit:]]
.
Table 9.22. Regular Expression Constraint Escapes
Escape | Description |
---|---|
\A | matches only at the beginning of the string
(see Section 9.7.3.5 for how this differs from
^ ) |
\m | matches only at the beginning of a word |
\M | matches only at the end of a word |
\y | matches only at the beginning or end of a word |
\Y | matches only at a point that is not the beginning or end of a word |
\Z | matches only at the end of the string
(see Section 9.7.3.5 for how this differs from
$ ) |
A word is defined as in the specification of
[[:<:]]
and [[:>:]]
above.
Constraint escapes are illegal within bracket expressions.
Table 9.23. Regular Expression Back References
Escape | Description |
---|---|
\ m | (where m is a nonzero digit)
a back reference to the m 'th subexpression |
\ mnn | (where m is a nonzero digit, and
nn is some more digits, and the decimal value
mnn is not greater than the number of closing capturing
parentheses seen so far)
a back reference to the mnn 'th subexpression |
There is an inherent ambiguity between octal character-entry escapes and back references, which is resolved by the following heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e., the number is in the legal range for a back reference), and otherwise is taken as octal.
In addition to the main syntax described above, there are some special forms and miscellaneous syntactic facilities available.
An RE can begin with one of two special director prefixes.
If an RE begins with ***:
,
the rest of the RE is taken as an ARE. (This normally has no effect in
PostgreSQL, since REs are assumed to be AREs;
but it does have an effect if ERE or BRE mode had been specified by
the flags
parameter to a regex function.)
If an RE begins with ***=
,
the rest of the RE is taken to be a literal string,
with all characters considered ordinary characters.
An ARE can begin with embedded options:
a sequence (?
xyz
)
(where xyz
is one or more alphabetic characters)
specifies options affecting the rest of the RE.
These options override any previously determined options —
in particular, they can override the case-sensitivity behavior implied by
a regex operator, or the flags
parameter to a regex
function.
The available option letters are
shown in Table 9.24.
Note that these same option letters are used in the flags
parameters of regex functions.
Table 9.24. ARE Embedded-Option Letters
Option | Description |
---|---|
b | rest of RE is a BRE |
c | case-sensitive matching (overrides operator type) |
e | rest of RE is an ERE |
i | case-insensitive matching (see Section 9.7.3.5) (overrides operator type) |
m | historical synonym for n |
n | newline-sensitive matching (see Section 9.7.3.5) |
p | partial newline-sensitive matching (see Section 9.7.3.5) |
q | rest of RE is a literal (“quoted”) string, all ordinary characters |
s | non-newline-sensitive matching (default) |
t | tight syntax (default; see below) |
w | inverse partial newline-sensitive (“weird”) matching (see Section 9.7.3.5) |
x | expanded syntax (see below) |
Embedded options take effect at the )
terminating the sequence.
They can appear only at the start of an ARE (after the
***:
director if any).
In addition to the usual (tight) RE syntax, in which all
characters are significant, there is an expanded syntax,
available by specifying the embedded x
option.
In the expanded syntax,
white-space characters in the RE are ignored, as are
all characters between a #
and the following newline (or the end of the RE). This
permits paragraphing and commenting a complex RE.
There are three exceptions to that basic rule:
a white-space character or #
preceded by \
is
retained
white space or #
within a bracket expression is retained
white space and comments cannot appear within multi-character symbols,
such as (?:
For this purpose, white-space characters are blank, tab, newline, and
any character that belongs to the space
character class.
Finally, in an ARE, outside bracket expressions, the sequence
(?#
ttt
)
(where ttt
is any text not containing a )
)
is a comment, completely ignored.
Again, this is not allowed between the characters of
multi-character symbols, like (?:
.
Such comments are more a historical artifact than a useful facility,
and their use is deprecated; use the expanded syntax instead.
None of these metasyntax extensions is available if
an initial ***=
director
has specified that the user's input be treated as a literal string
rather than as an RE.
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
Whether an RE is greedy or not is determined by the following rules:
Most atoms, and all constraints, have no greediness attribute (because they cannot match variable amounts of text anyway).
Adding parentheses around an RE does not change its greediness.
A quantified atom with a fixed-repetition quantifier
({
m
}
or
{
m
}?
)
has the same greediness (possibly none) as the atom itself.
A quantified atom with other normal quantifiers (including
{
m
,
n
}
with m
equal to n
)
is greedy (prefers longest match).
A quantified atom with a non-greedy quantifier (including
{
m
,
n
}?
with m
equal to n
)
is non-greedy (prefers shortest match).
A branch — that is, an RE that has no top-level
|
operator — has the same greediness as the first
quantified atom in it that has a greediness attribute.
An RE consisting of two or more branches connected by the
|
operator is always greedy.
The above rules associate greediness attributes not only with individual quantified atoms, but with branches and entire REs that contain quantified atoms. What that means is that the matching is done in such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole. Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})'); Result:123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})'); Result:1
In the first case, the RE as a whole is greedy because Y*
is greedy. It can match beginning at the Y
, and it matches
the longest possible string starting there, i.e., Y123
.
The output is the parenthesized part of that, or 123
.
In the second case, the RE as a whole is non-greedy because Y*?
is non-greedy. It can match beginning at the Y
, and it matches
the shortest possible string starting there, i.e., Y1
.
The subexpression [0-9]{1,3}
is greedy but it cannot change
the decision as to the overall match length; so it is forced to match
just 1
.
In short, when an RE contains both greedy and non-greedy subexpressions, the total match length is either as long as possible or as short as possible, according to the attribute assigned to the whole RE. The attributes assigned to the subexpressions only affect how much of that match they are allowed to “eat” relative to each other.
The quantifiers {1,1}
and {1,1}?
can be used to force greediness or non-greediness, respectively,
on a subexpression or a whole RE.
This is useful when you need the whole RE to have a greediness attribute
different from what's deduced from its elements. As an example,
suppose that we are trying to separate a string containing some digits
into the digits and the parts before and after them. We might try to
do that like this:
SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
Result: {abc0123,4,xyz}
That didn't work: the first .*
is greedy so
it “eats” as much as it can, leaving the \d+
to
match at the last possible place, the last digit. We might try to fix
that by making it non-greedy:
SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
Result: {abc,0,""}
That didn't work either, because now the RE as a whole is non-greedy and so it ends the overall match as soon as possible. We can get what we want by forcing the RE as a whole to be greedy:
SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
Result: {abc,01234,xyz}
Controlling the RE's overall greediness separately from its components' greediness allows great flexibility in handling variable-length patterns.
When deciding what is a longer or shorter match,
match lengths are measured in characters, not collating elements.
An empty string is considered longer than no match at all.
For example:
bb*
matches the three middle characters of abbbc
;
(week|wee)(night|knights)
matches all ten characters of weeknights
;
when (.*).*
is matched against abc
the parenthesized subexpression
matches all three characters; and when
(a*)*
is matched against bc
both the whole RE and the parenthesized
subexpression match an empty string.
If case-independent matching is specified,
the effect is much as if all case distinctions had vanished from the
alphabet.
When an alphabetic that exists in multiple cases appears as an
ordinary character outside a bracket expression, it is effectively
transformed into a bracket expression containing both cases,
e.g., x
becomes [xX]
.
When it appears inside a bracket expression, all case counterparts
of it are added to the bracket expression, e.g.,
[x]
becomes [xX]
and [^x]
becomes [^xX]
.
If newline-sensitive matching is specified, .
and bracket expressions using ^
will never match the newline character
(so that matches will not cross lines unless the RE
explicitly includes a newline)
and ^
and $
will match the empty string after and before a newline
respectively, in addition to matching at beginning and end of string
respectively.
But the ARE escapes \A
and \Z
continue to match beginning or end of string only.
Also, the character class shorthands \D
and \W
will match a newline regardless of this mode.
(Before PostgreSQL 14, they did not match
newlines when in newline-sensitive mode.
Write [^[:digit:]]
or [^[:word:]]
to get the old behavior.)
If partial newline-sensitive matching is specified,
this affects .
and bracket expressions
as with newline-sensitive matching, but not ^
and $
.
If inverse partial newline-sensitive matching is specified,
this affects ^
and $
as with newline-sensitive matching, but not .
and bracket expressions.
This isn't very useful but is provided for symmetry.
No particular limit is imposed on the length of REs in this implementation. However, programs intended to be highly portable should not employ REs longer than 256 bytes, as a POSIX-compliant implementation can refuse to accept such REs.
The only feature of AREs that is actually incompatible with
POSIX EREs is that \
does not lose its special
significance inside bracket expressions.
All other ARE features use syntax which is illegal or has
undefined or unspecified effects in POSIX EREs;
the ***
syntax of directors likewise is outside the POSIX
syntax for both BREs and EREs.
Many of the ARE extensions are borrowed from Perl, but some have
been changed to clean them up, and a few Perl extensions are not present.
Incompatibilities of note include \b
, \B
,
the lack of special treatment for a trailing newline,
the addition of complemented bracket expressions to the things
affected by newline-sensitive matching,
the restrictions on parentheses and back references in lookahead/lookbehind
constraints, and the longest/shortest-match (rather than first-match)
matching semantics.
BREs differ from EREs in several respects.
In BREs, |
, +
, and ?
are ordinary characters and there is no equivalent
for their functionality.
The delimiters for bounds are
\{
and \}
,
with {
and }
by themselves ordinary characters.
The parentheses for nested subexpressions are
\(
and \)
,
with (
and )
by themselves ordinary characters.
^
is an ordinary character except at the beginning of the
RE or the beginning of a parenthesized subexpression,
$
is an ordinary character except at the end of the
RE or the end of a parenthesized subexpression,
and *
is an ordinary character if it appears at the beginning
of the RE or the beginning of a parenthesized subexpression
(after a possible leading ^
).
Finally, single-digit back references are available, and
\<
and \>
are synonyms for
[[:<:]]
and [[:>:]]
respectively; no other escapes are available in BREs.
LIKE_REGEX
)
Since SQL:2008, the SQL standard includes
a LIKE_REGEX
operator that performs pattern
matching according to the XQuery regular expression
standard. PostgreSQL does not yet
implement this operator, but you can get very similar behavior using
the regexp_match()
function, since XQuery
regular expressions are quite close to the ARE syntax described above.
Notable differences between the existing POSIX-based regular-expression feature and XQuery regular expressions include:
XQuery character class subtraction is not supported. An example of
this feature is using the following to match only English
consonants: [a-z-[aeiou]]
.
XQuery character class shorthands \c
,
\C
, \i
,
and \I
are not supported.
XQuery character class elements
using \p{UnicodeProperty}
or the
inverse \P{UnicodeProperty}
are not supported.
POSIX interprets character classes such as \w
(see Table 9.21)
according to the prevailing locale (which you can control by
attaching a COLLATE
clause to the operator or
function). XQuery specifies these classes by reference to Unicode
character properties, so equivalent behavior is obtained only with
a locale that follows the Unicode rules.
The SQL standard (not XQuery itself) attempts to cater for more
variants of “newline” than POSIX does. The
newline-sensitive matching options described above consider only
ASCII NL (\n
) to be a newline, but SQL would have
us treat CR (\r
), CRLF (\r\n
)
(a Windows-style newline), and some Unicode-only characters like
LINE SEPARATOR (U+2028) as newlines as well.
Notably, .
and \s
should
count \r\n
as one character not two according to
SQL.
Of the character-entry escapes described in
Table 9.20,
XQuery supports only \n
, \r
,
and \t
.
XQuery does not support
the [:
syntax
for character classes within bracket expressions.
name
:]
XQuery does not have lookahead or lookbehind constraints, nor any of the constraint escapes described in Table 9.22.
The metasyntax forms described in Section 9.7.3.4 do not exist in XQuery.
The regular expression flag letters defined by XQuery are
related to but not the same as the option letters for POSIX
(Table 9.24). While the
i
and q
options behave the
same, others do not:
XQuery's s
(allow dot to match newline)
and m
(allow ^
and $
to match at newlines) flags provide
access to the same behaviors as
POSIX's n
, p
and w
flags, but they
do not match the behavior of
POSIX's s
and m
flags.
Note in particular that dot-matches-newline is the default
behavior in POSIX but not XQuery.
XQuery's x
(ignore whitespace in pattern) flag
is noticeably different from POSIX's expanded-mode flag.
POSIX's x
flag also
allows #
to begin a comment in the pattern,
and POSIX will not ignore a whitespace character after a
backslash.
The PostgreSQL formatting functions provide a powerful set of tools for converting various data types (date/time, integer, floating point, numeric) to formatted strings and for converting from formatted strings to specific data types. Table 9.25 lists them. These functions all follow a common calling convention: the first argument is the value to be formatted and the second argument is a template that defines the output or input format.
Table 9.25. Formatting Functions
Function Description Example(s) |
---|
Converts time stamp to string according to the given format.
|
Converts interval to string according to the given format.
|
Converts number to string according to the given format; available
for
|
Converts string to date according to the given format.
|
Converts string to numeric according to the given format.
|
Converts string to time stamp according to the given format.
(See also
|
to_timestamp
and to_date
exist to handle input formats that cannot be converted by
simple casting. For most standard date/time formats, simply casting the
source string to the required data type works, and is much easier.
Similarly, to_number
is unnecessary for standard numeric
representations.
In a to_char
output template string, there are certain
patterns that are recognized and replaced with appropriately-formatted
data based on the given value. Any text that is not a template pattern is
simply copied verbatim. Similarly, in an input template string (for the
other functions), template patterns identify the values to be supplied by
the input data string. If there are characters in the template string
that are not template patterns, the corresponding characters in the input
data string are simply skipped over (whether or not they are equal to the
template string characters).
Table 9.26 shows the template patterns available for formatting date and time values.
Table 9.26. Template Patterns for Date/Time Formatting
Pattern | Description |
---|---|
HH | hour of day (01–12) |
HH12 | hour of day (01–12) |
HH24 | hour of day (00–23) |
MI | minute (00–59) |
SS | second (00–59) |
MS | millisecond (000–999) |
US | microsecond (000000–999999) |
FF1 | tenth of second (0–9) |
FF2 | hundredth of second (00–99) |
FF3 | millisecond (000–999) |
FF4 | tenth of a millisecond (0000–9999) |
FF5 | hundredth of a millisecond (00000–99999) |
FF6 | microsecond (000000–999999) |
SSSS , SSSSS | seconds past midnight (0–86399) |
AM , am ,
PM or pm | meridiem indicator (without periods) |
A.M. , a.m. ,
P.M. or p.m. | meridiem indicator (with periods) |
Y,YYY | year (4 or more digits) with comma |
YYYY | year (4 or more digits) |
YYY | last 3 digits of year |
YY | last 2 digits of year |
Y | last digit of year |
IYYY | ISO 8601 week-numbering year (4 or more digits) |
IYY | last 3 digits of ISO 8601 week-numbering year |
IY | last 2 digits of ISO 8601 week-numbering year |
I | last digit of ISO 8601 week-numbering year |
BC , bc ,
AD or ad | era indicator (without periods) |
B.C. , b.c. ,
A.D. or a.d. | era indicator (with periods) |
MONTH | full upper case month name (blank-padded to 9 chars) |
Month | full capitalized month name (blank-padded to 9 chars) |
month | full lower case month name (blank-padded to 9 chars) |
MON | abbreviated upper case month name (3 chars in English, localized lengths vary) |
Mon | abbreviated capitalized month name (3 chars in English, localized lengths vary) |
mon | abbreviated lower case month name (3 chars in English, localized lengths vary) |
MM | month number (01–12) |
DAY | full upper case day name (blank-padded to 9 chars) |
Day | full capitalized day name (blank-padded to 9 chars) |
day | full lower case day name (blank-padded to 9 chars) |
DY | abbreviated upper case day name (3 chars in English, localized lengths vary) |
Dy | abbreviated capitalized day name (3 chars in English, localized lengths vary) |
dy | abbreviated lower case day name (3 chars in English, localized lengths vary) |
DDD | day of year (001–366) |
IDDD | day of ISO 8601 week-numbering year (001–371; day 1 of the year is Monday of the first ISO week) |
DD | day of month (01–31) |
D | day of the week, Sunday (1 ) to Saturday (7 ) |
ID | ISO 8601 day of the week, Monday (1 ) to Sunday (7 ) |
W | week of month (1–5) (the first week starts on the first day of the month) |
WW | week number of year (1–53) (the first week starts on the first day of the year) |
IW | week number of ISO 8601 week-numbering year (01–53; the first Thursday of the year is in week 1) |
CC | century (2 digits) (the twenty-first century starts on 2001-01-01) |
J | Julian Date (integer days since November 24, 4714 BC at local midnight; see Section B.7) |
Q | quarter |
RM | month in upper case Roman numerals (I–XII; I=January) |
rm | month in lower case Roman numerals (i–xii; i=January) |
TZ | upper case time-zone abbreviation
(only supported in to_char ) |
tz | lower case time-zone abbreviation
(only supported in to_char ) |
TZH | time-zone hours |
TZM | time-zone minutes |
OF | time-zone offset from UTC
(only supported in to_char ) |
Modifiers can be applied to any template pattern to alter its
behavior. For example, FMMonth
is the Month
pattern with the
FM
modifier.
Table 9.27 shows the
modifier patterns for date/time formatting.
Table 9.27. Template Pattern Modifiers for Date/Time Formatting
Modifier | Description | Example |
---|---|---|
FM prefix | fill mode (suppress leading zeroes and padding blanks) | FMMonth |
TH suffix | upper case ordinal number suffix | DDTH , e.g., 12TH |
th suffix | lower case ordinal number suffix | DDth , e.g., 12th |
FX prefix | fixed format global option (see usage notes) | FX Month DD Day |
TM prefix | translation mode (use localized day and month names based on lc_time) | TMMonth |
SP suffix | spell mode (not implemented) | DDSP |
Usage notes for date/time formatting:
FM
suppresses leading zeroes and trailing blanks
that would otherwise be added to make the output of a pattern be
fixed-width. In PostgreSQL,
FM
modifies only the next specification, while in
Oracle FM
affects all subsequent
specifications, and repeated FM
modifiers
toggle fill mode on and off.
TM
suppresses trailing blanks whether or
not FM
is specified.
to_timestamp
and to_date
ignore letter case in the input; so for
example MON
, Mon
,
and mon
all accept the same strings. When using
the TM
modifier, case-folding is done according to
the rules of the function's input collation (see
Section 24.2).
to_timestamp
and to_date
skip multiple blank spaces at the beginning of the input string and
around date and time values unless the FX
option is used. For example,
to_timestamp(' 2000 JUN', 'YYYY MON')
and
to_timestamp('2000 - JUN', 'YYYY-MON')
work, but
to_timestamp('2000 JUN', 'FXYYYY MON')
returns an error
because to_timestamp
expects only a single space.
FX
must be specified as the first item in
the template.
A separator (a space or non-letter/non-digit character) in the template string of
to_timestamp
and to_date
matches any single separator in the input string or is skipped,
unless the FX
option is used.
For example, to_timestamp('2000JUN', 'YYYY///MON')
and
to_timestamp('2000/JUN', 'YYYY MON')
work, but
to_timestamp('2000//JUN', 'YYYY/MON')
returns an error because the number of separators in the input string
exceeds the number of separators in the template.
If FX
is specified, a separator in the template string
matches exactly one character in the input string. But note that the
input string character is not required to be the same as the separator from the template string.
For example, to_timestamp('2000/JUN', 'FXYYYY MON')
works, but to_timestamp('2000/JUN', 'FXYYYY MON')
returns an error because the second space in the template string consumes
the letter J
from the input string.
A TZH
template pattern can match a signed number.
Without the FX
option, minus signs may be ambiguous,
and could be interpreted as a separator.
This ambiguity is resolved as follows: If the number of separators before
TZH
in the template string is less than the number of
separators before the minus sign in the input string, the minus sign
is interpreted as part of TZH
.
Otherwise, the minus sign is considered to be a separator between values.
For example, to_timestamp('2000 -10', 'YYYY TZH')
matches
-10
to TZH
, but
to_timestamp('2000 -10', 'YYYY TZH')
matches 10
to TZH
.
Ordinary text is allowed in to_char
templates and will be output literally. You can put a substring
in double quotes to force it to be interpreted as literal text
even if it contains template patterns. For example, in
'"Hello Year "YYYY'
, the YYYY
will be replaced by the year data, but the single Y
in Year
will not be.
In to_date
, to_number
,
and to_timestamp
, literal text and double-quoted
strings result in skipping the number of characters contained in the
string; for example "XX"
skips two input characters
(whether or not they are XX
).
Prior to PostgreSQL 12, it was possible to
skip arbitrary text in the input string using non-letter or non-digit
characters. For example,
to_timestamp('2000y6m1d', 'yyyy-MM-DD')
used to
work. Now you can only use letter characters for this purpose. For example,
to_timestamp('2000y6m1d', 'yyyytMMtDDt')
and
to_timestamp('2000y6m1d', 'yyyy"y"MM"m"DD"d"')
skip y
, m
, and
d
.
If you want to have a double quote in the output you must
precede it with a backslash, for example '\"YYYY
Month\"'
.
Backslashes are not otherwise special outside of double-quoted
strings. Within a double-quoted string, a backslash causes the
next character to be taken literally, whatever it is (but this
has no special effect unless the next character is a double quote
or another backslash).
In to_timestamp
and to_date
,
if the year format specification is less than four digits, e.g.,
YYY
, and the supplied year is less than four digits,
the year will be adjusted to be nearest to the year 2020, e.g.,
95
becomes 1995.
In to_timestamp
and to_date
,
negative years are treated as signifying BC. If you write both a
negative year and an explicit BC
field, you get AD
again. An input of year zero is treated as 1 BC.
In to_timestamp
and to_date
,
the YYYY
conversion has a restriction when
processing years with more than 4 digits. You must
use some non-digit character or template after YYYY
,
otherwise the year is always interpreted as 4 digits. For example
(with the year 20000):
to_date('200001130', 'YYYYMMDD')
will be
interpreted as a 4-digit year; instead use a non-digit
separator after the year, like
to_date('20000-1130', 'YYYY-MMDD')
or
to_date('20000Nov30', 'YYYYMonDD')
.
In to_timestamp
and to_date
,
the CC
(century) field is accepted but ignored
if there is a YYY
, YYYY
or
Y,YYY
field. If CC
is used with
YY
or Y
then the result is
computed as that year in the specified century. If the century is
specified but the year is not, the first year of the century
is assumed.
In to_timestamp
and to_date
,
weekday names or numbers (DAY
, D
,
and related field types) are accepted but are ignored for purposes of
computing the result. The same is true for quarter
(Q
) fields.
In to_timestamp
and to_date
,
an ISO 8601 week-numbering date (as distinct from a Gregorian date)
can be specified in one of two ways:
Year, week number, and weekday: for
example to_date('2006-42-4', 'IYYY-IW-ID')
returns the date 2006-10-19
.
If you omit the weekday it is assumed to be 1 (Monday).
Year and day of year: for example to_date('2006-291',
'IYYY-IDDD')
also returns 2006-10-19
.
Attempting to enter a date using a mixture of ISO 8601 week-numbering fields and Gregorian date fields is nonsensical, and will cause an error. In the context of an ISO 8601 week-numbering year, the concept of a “month” or “day of month” has no meaning. In the context of a Gregorian year, the ISO week has no meaning.
While to_date
will reject a mixture of
Gregorian and ISO week-numbering date
fields, to_char
will not, since output format
specifications like YYYY-MM-DD (IYYY-IDDD)
can be
useful. But avoid writing something like IYYY-MM-DD
;
that would yield surprising results near the start of the year.
(See Section 9.9.1 for more
information.)
In to_timestamp
, millisecond
(MS
) or microsecond (US
)
fields are used as the
seconds digits after the decimal point. For example
to_timestamp('12.3', 'SS.MS')
is not 3 milliseconds,
but 300, because the conversion treats it as 12 + 0.3 seconds.
So, for the format SS.MS
, the input values
12.3
, 12.30
,
and 12.300
specify the
same number of milliseconds. To get three milliseconds, one must write
12.003
, which the conversion treats as
12 + 0.003 = 12.003 seconds.
Here is a more
complex example:
to_timestamp('15:12:02.020.001230', 'HH24:MI:SS.MS.US')
is 15 hours, 12 minutes, and 2 seconds + 20 milliseconds +
1230 microseconds = 2.021230 seconds.
to_char(..., 'ID')
's day of the week numbering
matches the extract(isodow from ...)
function, but
to_char(..., 'D')
's does not match
extract(dow from ...)
's day numbering.
to_char(interval)
formats HH
and
HH12
as shown on a 12-hour clock, for example zero hours
and 36 hours both output as 12
, while HH24
outputs the full hour value, which can exceed 23 in
an interval
value.
Table 9.28 shows the template patterns available for formatting numeric values.
Table 9.28. Template Patterns for Numeric Formatting
Pattern | Description |
---|---|
9 | digit position (can be dropped if insignificant) |
0 | digit position (will not be dropped, even if insignificant) |
. (period) | decimal point |
, (comma) | group (thousands) separator |
PR | negative value in angle brackets |
S | sign anchored to number (uses locale) |
L | currency symbol (uses locale) |
D | decimal point (uses locale) |
G | group separator (uses locale) |
MI | minus sign in specified position (if number < 0) |
PL | plus sign in specified position (if number > 0) |
SG | plus/minus sign in specified position |
RN | Roman numeral (input between 1 and 3999) |
TH or th | ordinal number suffix |
V | shift specified number of digits (see notes) |
EEEE | exponent for scientific notation |
Usage notes for numeric formatting:
0
specifies a digit position that will always be printed,
even if it contains a leading/trailing zero. 9
also
specifies a digit position, but if it is a leading zero then it will
be replaced by a space, while if it is a trailing zero and fill mode
is specified then it will be deleted. (For to_number()
,
these two pattern characters are equivalent.)
If the format provides fewer fractional digits than the number being
formatted, to_char()
will round the number to
the specified number of fractional digits.
The pattern characters S
, L
, D
,
and G
represent the sign, currency symbol, decimal point,
and thousands separator characters defined by the current locale
(see lc_monetary
and lc_numeric). The pattern characters period
and comma represent those exact characters, with the meanings of
decimal point and thousands separator, regardless of locale.
If no explicit provision is made for a sign
in to_char()
's pattern, one column will be reserved for
the sign, and it will be anchored to (appear just left of) the
number. If S
appears just left of some 9
's,
it will likewise be anchored to the number.
A sign formatted using SG
, PL
, or
MI
is not anchored to
the number; for example,
to_char(-12, 'MI9999')
produces '- 12'
but to_char(-12, 'S9999')
produces ' -12'
.
(The Oracle implementation does not allow the use of
MI
before 9
, but rather
requires that 9
precede
MI
.)
TH
does not convert values less than zero
and does not convert fractional numbers.
PL
, SG
, and
TH
are PostgreSQL
extensions.
In to_number
, if non-data template patterns such
as L
or TH
are used, the
corresponding number of input characters are skipped, whether or not
they match the template pattern, unless they are data characters
(that is, digits, sign, decimal point, or comma). For
example, TH
would skip two non-data characters.
V
with to_char
multiplies the input values by
10^
, where
n
n
is the number of digits following
V
. V
with
to_number
divides in a similar manner.
to_char
and to_number
do not support the use of
V
combined with a decimal point
(e.g., 99.9V99
is not allowed).
EEEE
(scientific notation) cannot be used in
combination with any of the other formatting patterns or
modifiers other than digit and decimal point patterns, and must be at the end of the format string
(e.g., 9.99EEEE
is a valid pattern).
Certain modifiers can be applied to any template pattern to alter its
behavior. For example, FM99.99
is the 99.99
pattern with the
FM
modifier.
Table 9.29 shows the
modifier patterns for numeric formatting.
Table 9.29. Template Pattern Modifiers for Numeric Formatting
Modifier | Description | Example |
---|---|---|
FM prefix | fill mode (suppress trailing zeroes and padding blanks) | FM99.99 |
TH suffix | upper case ordinal number suffix | 999TH |
th suffix | lower case ordinal number suffix | 999th |
Table 9.30 shows some
examples of the use of the to_char
function.
Table 9.30. to_char
Examples
Expression | Result |
---|---|
to_char(current_timestamp, 'Day, DD HH12:MI:SS') | 'Tuesday , 06 05:39:18' |
to_char(current_timestamp, 'FMDay, FMDD HH12:MI:SS') | 'Tuesday, 6 05:39:18' |
to_char(-0.1, '99.99') | ' -.10' |
to_char(-0.1, 'FM9.99') | '-.1' |
to_char(-0.1, 'FM90.99') | '-0.1' |
to_char(0.1, '0.9') | ' 0.1' |
to_char(12, '9990999.9') | ' 0012.0' |
to_char(12, 'FM9990999.9') | '0012.' |
to_char(485, '999') | ' 485' |
to_char(-485, '999') | '-485' |
to_char(485, '9 9 9') | ' 4 8 5' |
to_char(1485, '9,999') | ' 1,485' |
to_char(1485, '9G999') | ' 1 485' |
to_char(148.5, '999.999') | ' 148.500' |
to_char(148.5, 'FM999.999') | '148.5' |
to_char(148.5, 'FM999.990') | '148.500' |
to_char(148.5, '999D999') | ' 148,500' |
to_char(3148.5, '9G999D999') | ' 3 148,500' |
to_char(-485, '999S') | '485-' |
to_char(-485, '999MI') | '485-' |
to_char(485, '999MI') | '485 ' |
to_char(485, 'FM999MI') | '485' |
to_char(485, 'PL999') | '+485' |
to_char(485, 'SG999') | '+485' |
to_char(-485, 'SG999') | '-485' |
to_char(-485, '9SG99') | '4-85' |
to_char(-485, '999PR') | '<485>' |
to_char(485, 'L999') | 'DM 485' |
to_char(485, 'RN') | ' CDLXXXV' |
to_char(485, 'FMRN') | 'CDLXXXV' |
to_char(5.2, 'FMRN') | 'V' |
to_char(482, '999th') | ' 482nd' |
to_char(485, '"Good number:"999') | 'Good number: 485' |
to_char(485.8, '"Pre:"999" Post:" .999') | 'Pre: 485 Post: .800' |
to_char(12, '99V999') | ' 12000' |
to_char(12.4, '99V999') | ' 12400' |
to_char(12.45, '99V9') | ' 125' |
to_char(0.0004859, '9.99EEEE') | ' 4.86e-04' |
Table 9.32 shows the available
functions for date/time value processing, with details appearing in
the following subsections. Table 9.31 illustrates the behaviors of
the basic arithmetic operators (+
,
*
, etc.). For formatting functions, refer to
Section 9.8. You should be familiar with
the background information on date/time data types from Section 8.5.
In addition, the usual comparison operators shown in
Table 9.1 are available for the
date/time types. Dates and timestamps (with or without time zone) are
all comparable, while times (with or without time zone) and intervals
can only be compared to other values of the same data type. When
comparing a timestamp without time zone to a timestamp with time zone,
the former value is assumed to be given in the time zone specified by
the TimeZone configuration parameter, and is
rotated to UTC for comparison to the latter value (which is already
in UTC internally). Similarly, a date value is assumed to represent
midnight in the TimeZone
zone when comparing it
to a timestamp.
All the functions and operators described below that take time
or timestamp
inputs actually come in two variants: one that takes time with time zone
or timestamp
with time zone
, and one that takes time without time zone
or timestamp without time zone
.
For brevity, these variants are not shown separately. Also, the
+
and *
operators come in commutative pairs (for
example both date
+
integer
and integer
+
date
); we show
only one of each such pair.
Table 9.31. Date/Time Operators
Operator Description Example(s) |
---|
Add a number of days to a date
|
Add an interval to a date
|
Add a time-of-day to a date
|
Add intervals
|
Add an interval to a timestamp
|
Add an interval to a time
|
Negate an interval
|
Subtract dates, producing the number of days elapsed
|
Subtract a number of days from a date
|
Subtract an interval from a date
|
Subtract times
|
Subtract an interval from a time
|
Subtract an interval from a timestamp
|
Subtract intervals
|
Subtract timestamps (converting 24-hour intervals into days,
similarly to
|
Multiply an interval by a scalar
|
Divide an interval by a scalar
|
Table 9.32. Date/Time Functions
Function Description Example(s) |
---|
Subtract arguments, producing a “symbolic” result that uses years and months, rather than just days
|
Subtract argument from
|
Current date and time (changes during statement execution); see Section 9.9.5
|
Current date; see Section 9.9.5
|
Current time of day; see Section 9.9.5
|
Current time of day, with limited precision; see Section 9.9.5
|
Current date and time (start of current transaction); see Section 9.9.5
|
Current date and time (start of current transaction), with limited precision; see Section 9.9.5
|
Bin input into specified interval aligned with specified origin; see Section 9.9.3
|
Get timestamp subfield (equivalent to
|
Get interval subfield (equivalent to
|
Truncate to specified precision; see Section 9.9.2
|
Truncate to specified precision in the specified time zone; see Section 9.9.2
|
Truncate to specified precision; see Section 9.9.2
|
Get timestamp subfield; see Section 9.9.1
|
Get interval subfield; see Section 9.9.1
|
Test for finite date (not +/-infinity)
|
Test for finite timestamp (not +/-infinity)
|
Test for finite interval (currently always true)
|
Adjust interval, converting 30-day time periods to months
|
Adjust interval, converting 24-hour time periods to days
|
Adjust interval using
|
Current time of day; see Section 9.9.5
|
Current time of day, with limited precision; see Section 9.9.5
|
Current date and time (start of current transaction); see Section 9.9.5
|
Current date and time (start of current transaction), with limited precision; see Section 9.9.5
|
Create date from year, month and day fields (negative years signify BC)
|
Create interval from years, months, weeks, days, hours, minutes and seconds fields, each of which can default to zero
|
Create time from hour, minute and seconds fields
|
Create timestamp from year, month, day, hour, minute and seconds fields (negative years signify BC)
|
Create timestamp with time zone from year, month, day, hour, minute
and seconds fields (negative years signify BC).
If
|
Current date and time (start of current transaction); see Section 9.9.5
|
Current date and time (start of current statement); see Section 9.9.5
|
Current date and time
(like
|
Current date and time (start of current transaction); see Section 9.9.5
|
Convert Unix epoch (seconds since 1970-01-01 00:00:00+00) to timestamp with time zone
|
In addition to these functions, the SQL OVERLAPS
operator is
supported:
(start1
,end1
) OVERLAPS (start2
,end2
) (start1
,length1
) OVERLAPS (start2
,length2
)
This expression yields true when two time periods (defined by their
endpoints) overlap, false when they do not overlap. The endpoints
can be specified as pairs of dates, times, or time stamps; or as
a date, time, or time stamp followed by an interval. When a pair
of values is provided, either the start or the end can be written
first; OVERLAPS
automatically takes the earlier value
of the pair as the start. Each time period is considered to
represent the half-open interval start
<=
time
<
end
, unless
start
and end
are equal in which case it
represents that single time instant. This means for instance that two
time periods with only an endpoint in common do not overlap.
SELECT (DATE '2001-02-16', DATE '2001-12-21') OVERLAPS (DATE '2001-10-30', DATE '2002-10-30'); Result:true
SELECT (DATE '2001-02-16', INTERVAL '100 days') OVERLAPS (DATE '2001-10-30', DATE '2002-10-30'); Result:false
SELECT (DATE '2001-10-29', DATE '2001-10-30') OVERLAPS (DATE '2001-10-30', DATE '2001-10-31'); Result:false
SELECT (DATE '2001-10-30', DATE '2001-10-30') OVERLAPS (DATE '2001-10-30', DATE '2001-10-31'); Result:true
When adding an interval
value to (or subtracting an
interval
value from) a timestamp with time zone
value, the days component advances or decrements the date of the
timestamp with time zone
by the indicated number of days,
keeping the time of day the same.
Across daylight saving time changes (when the session time zone is set to a
time zone that recognizes DST), this means interval '1 day'
does not necessarily equal interval '24 hours'
.
For example, with the session time zone set
to America/Denver
:
SELECT timestamp with time zone '2005-04-02 12:00:00-07' + interval '1 day'; Result:2005-04-03 12:00:00-06
SELECT timestamp with time zone '2005-04-02 12:00:00-07' + interval '24 hours'; Result:2005-04-03 13:00:00-06
This happens because an hour was skipped due to a change in daylight saving
time at 2005-04-03 02:00:00
in time zone
America/Denver
.
Note there can be ambiguity in the months
field returned by
age
because different months have different numbers of
days. PostgreSQL's approach uses the month from the
earlier of the two dates when calculating partial months. For example,
age('2004-06-01', '2004-04-30')
uses April to yield
1 mon 1 day
, while using May would yield 1 mon 2
days
because May has 31 days, while April has only 30.
Subtraction of dates and timestamps can also be complex. One conceptually
simple way to perform subtraction is to convert each value to a number
of seconds using EXTRACT(EPOCH FROM ...)
, then subtract the
results; this produces the
number of seconds between the two values. This will adjust
for the number of days in each month, timezone changes, and daylight
saving time adjustments. Subtraction of date or timestamp
values with the “-
” operator
returns the number of days (24-hours) and hours/minutes/seconds
between the values, making the same adjustments. The age
function returns years, months, days, and hours/minutes/seconds,
performing field-by-field subtraction and then adjusting for negative
field values. The following queries illustrate the differences in these
approaches. The sample results were produced with timezone
= 'US/Eastern'
; there is a daylight saving time change between the
two dates used:
SELECT EXTRACT(EPOCH FROM timestamptz '2013-07-01 12:00:00') - EXTRACT(EPOCH FROM timestamptz '2013-03-01 12:00:00'); Result:10537200.000000
SELECT (EXTRACT(EPOCH FROM timestamptz '2013-07-01 12:00:00') - EXTRACT(EPOCH FROM timestamptz '2013-03-01 12:00:00')) / 60 / 60 / 24; Result:121.9583333333333333
SELECT timestamptz '2013-07-01 12:00:00' - timestamptz '2013-03-01 12:00:00'; Result:121 days 23:00:00
SELECT age(timestamptz '2013-07-01 12:00:00', timestamptz '2013-03-01 12:00:00'); Result:4 mons
EXTRACT
, date_part
EXTRACT(field
FROMsource
)
The extract
function retrieves subfields
such as year or hour from date/time values.
source
must be a value expression of
type timestamp
, date
, time
,
or interval
. (Timestamps and times can be with or
without time zone.)
field
is an identifier or
string that selects what field to extract from the source value.
Not all fields are valid for every input data type; for example, fields
smaller than a day cannot be extracted from a date
, while
fields of a day or more cannot be extracted from a time
.
The extract
function returns values of type
numeric
.
The following are valid field names:
century
The century; for interval
values, the year field
divided by 100
SELECT EXTRACT(CENTURY FROM TIMESTAMP '2000-12-16 12:21:13'); Result:20
SELECT EXTRACT(CENTURY FROM TIMESTAMP '2001-02-16 20:38:40'); Result:21
SELECT EXTRACT(CENTURY FROM DATE '0001-01-01 AD'); Result:1
SELECT EXTRACT(CENTURY FROM DATE '0001-12-31 BC'); Result:-1
SELECT EXTRACT(CENTURY FROM INTERVAL '2001 years'); Result:20
day
The day of the month (1–31); for interval
values, the number of days
SELECT EXTRACT(DAY FROM TIMESTAMP '2001-02-16 20:38:40'); Result:16
SELECT EXTRACT(DAY FROM INTERVAL '40 days 1 minute'); Result:40
decade
The year field divided by 10
SELECT EXTRACT(DECADE FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 200
dow
The day of the week as Sunday (0
) to
Saturday (6
)
SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 5
Note that extract
's day of the week numbering
differs from that of the to_char(...,
'D')
function.
doy
The day of the year (1–365/366)
SELECT EXTRACT(DOY FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 47
epoch
For timestamp with time zone
values, the
number of seconds since 1970-01-01 00:00:00 UTC (negative for
timestamps before that);
for date
and timestamp
values, the
nominal number of seconds since 1970-01-01 00:00:00,
without regard to timezone or daylight-savings rules;
for interval
values, the total number
of seconds in the interval
SELECT EXTRACT(EPOCH FROM TIMESTAMP WITH TIME ZONE '2001-02-16 20:38:40.12-08'); Result:982384720.120000
SELECT EXTRACT(EPOCH FROM TIMESTAMP '2001-02-16 20:38:40.12'); Result:982355920.120000
SELECT EXTRACT(EPOCH FROM INTERVAL '5 days 3 hours'); Result:442800.000000
You can convert an epoch value back to a timestamp with time zone
with to_timestamp
:
SELECT to_timestamp(982384720.12);
Result: 2001-02-17 04:38:40.12+00
Beware that applying to_timestamp
to an epoch
extracted from a date
or timestamp
value
could produce a misleading result: the result will effectively
assume that the original value had been given in UTC, which might
not be the case.
hour
The hour field (0–23 in timestamps, unrestricted in intervals)
SELECT EXTRACT(HOUR FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 20
isodow
The day of the week as Monday (1
) to
Sunday (7
)
SELECT EXTRACT(ISODOW FROM TIMESTAMP '2001-02-18 20:38:40');
Result: 7
This is identical to dow
except for Sunday. This
matches the ISO 8601 day of the week numbering.
isoyear
The ISO 8601 week-numbering year that the date falls in
SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-01'); Result:2005
SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-02'); Result:2006
Each ISO 8601 week-numbering year begins with the
Monday of the week containing the 4th of January, so in early
January or late December the ISO year may be
different from the Gregorian year. See the week
field for more information.
julian
The Julian Date corresponding to the date or timestamp. Timestamps that are not local midnight result in a fractional value. See Section B.7 for more information.
SELECT EXTRACT(JULIAN FROM DATE '2006-01-01'); Result:2453737
SELECT EXTRACT(JULIAN FROM TIMESTAMP '2006-01-01 12:00'); Result:2453737.50000000000000000000
microseconds
The seconds field, including fractional parts, multiplied by 1 000 000; note that this includes full seconds
SELECT EXTRACT(MICROSECONDS FROM TIME '17:12:28.5');
Result: 28500000
millennium
The millennium; for interval
values, the year field
divided by 1000
SELECT EXTRACT(MILLENNIUM FROM TIMESTAMP '2001-02-16 20:38:40'); Result:3
SELECT EXTRACT(MILLENNIUM FROM INTERVAL '2001 years'); Result:2
Years in the 1900s are in the second millennium. The third millennium started January 1, 2001.
milliseconds
The seconds field, including fractional parts, multiplied by 1000. Note that this includes full seconds.
SELECT EXTRACT(MILLISECONDS FROM TIME '17:12:28.5');
Result: 28500.000
minute
The minutes field (0–59)
SELECT EXTRACT(MINUTE FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 38
month
The number of the month within the year (1–12);
for interval
values, the number of months modulo 12
(0–11)
SELECT EXTRACT(MONTH FROM TIMESTAMP '2001-02-16 20:38:40'); Result:2
SELECT EXTRACT(MONTH FROM INTERVAL '2 years 3 months'); Result:3
SELECT EXTRACT(MONTH FROM INTERVAL '2 years 13 months'); Result:1
quarter
The quarter of the year (1–4) that the date is in
SELECT EXTRACT(QUARTER FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 1
second
The seconds field, including any fractional seconds
SELECT EXTRACT(SECOND FROM TIMESTAMP '2001-02-16 20:38:40'); Result:40.000000
SELECT EXTRACT(SECOND FROM TIME '17:12:28.5'); Result:28.500000
timezone
The time zone offset from UTC, measured in seconds. Positive values correspond to time zones east of UTC, negative values to zones west of UTC. (Technically, PostgreSQL does not use UTC because leap seconds are not handled.)
timezone_hour
The hour component of the time zone offset
timezone_minute
The minute component of the time zone offset
week
The number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year. In other words, the first Thursday of a year is in week 1 of that year.
In the ISO week-numbering system, it is possible for early-January
dates to be part of the 52nd or 53rd week of the previous year, and for
late-December dates to be part of the first week of the next year.
For example, 2005-01-01
is part of the 53rd week of year
2004, and 2006-01-01
is part of the 52nd week of year
2005, while 2012-12-31
is part of the first week of 2013.
It's recommended to use the isoyear
field together with
week
to get consistent results.
SELECT EXTRACT(WEEK FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 7
year
The year field. Keep in mind there is no 0 AD
, so subtracting
BC
years from AD
years should be done with care.
SELECT EXTRACT(YEAR FROM TIMESTAMP '2001-02-16 20:38:40');
Result: 2001
When processing an interval
value,
the extract
function produces field values that
match the interpretation used by the interval output function. This
can produce surprising results if one starts with a non-normalized
interval representation, for example:
SELECT INTERVAL '80 minutes'; Result:01:20:00
SELECT EXTRACT(MINUTES FROM INTERVAL '80 minutes'); Result:20
When the input value is +/-Infinity, extract
returns
+/-Infinity for monotonically-increasing fields (epoch
,
julian
, year
, isoyear
,
decade
, century
, and millennium
).
For other fields, NULL is returned. PostgreSQL
versions before 9.6 returned zero for all cases of infinite input.
The extract
function is primarily intended
for computational processing. For formatting date/time values for
display, see Section 9.8.
The date_part
function is modeled on the traditional
Ingres equivalent to the
SQL-standard function extract
:
date_part('field
',source
)
Note that here the field
parameter needs to
be a string value, not a name. The valid field names for
date_part
are the same as for
extract
.
For historical reasons, the date_part
function
returns values of type double precision
. This can result in
a loss of precision in certain uses. Using extract
is recommended instead.
SELECT date_part('day', TIMESTAMP '2001-02-16 20:38:40'); Result:16
SELECT date_part('hour', INTERVAL '4 hours 3 minutes'); Result:4
date_trunc
The function date_trunc
is conceptually
similar to the trunc
function for numbers.
date_trunc(field
,source
[,time_zone
])
source
is a value expression of type
timestamp
, timestamp with time zone
,
or interval
.
(Values of type date
and
time
are cast automatically to timestamp
or
interval
, respectively.)
field
selects to which precision to
truncate the input value. The return value is likewise of type
timestamp
, timestamp with time zone
,
or interval
,
and it has all fields that are less significant than the
selected one set to zero (or one, for day and month).
Valid values for field
are:
microseconds |
milliseconds |
second |
minute |
hour |
day |
week |
month |
quarter |
year |
decade |
century |
millennium |
When the input value is of type timestamp with time zone
,
the truncation is performed with respect to a particular time zone;
for example, truncation to day
produces a value that
is midnight in that zone. By default, truncation is done with respect
to the current TimeZone setting, but the
optional time_zone
argument can be provided
to specify a different time zone. The time zone name can be specified
in any of the ways described in Section 8.5.3.
A time zone cannot be specified when processing timestamp without
time zone
or interval
inputs. These are always
taken at face value.
Examples (assuming the local time zone is America/New_York
):
SELECT date_trunc('hour', TIMESTAMP '2001-02-16 20:38:40'); Result:2001-02-16 20:00:00
SELECT date_trunc('year', TIMESTAMP '2001-02-16 20:38:40'); Result:2001-01-01 00:00:00
SELECT date_trunc('day', TIMESTAMP WITH TIME ZONE '2001-02-16 20:38:40+00'); Result:2001-02-16 00:00:00-05
SELECT date_trunc('day', TIMESTAMP WITH TIME ZONE '2001-02-16 20:38:40+00', 'Australia/Sydney'); Result:2001-02-16 08:00:00-05
SELECT date_trunc('hour', INTERVAL '3 days 02:47:33'); Result:3 days 02:00:00
date_bin
The function date_bin
“bins” the input
timestamp into the specified interval (the stride)
aligned with a specified origin.
date_bin(stride
,source
,origin
)
source
is a value expression of type
timestamp
or timestamp with time zone
. (Values
of type date
are cast automatically to
timestamp
.) stride
is a value
expression of type interval
. The return value is likewise
of type timestamp
or timestamp with time zone
,
and it marks the beginning of the bin into which the
source
is placed.
Examples:
SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-01-01'); Result:2020-02-11 15:30:00
SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-01-01 00:02:30'); Result:2020-02-11 15:32:30
In the case of full units (1 minute, 1 hour, etc.), it gives the same result as
the analogous date_trunc
call, but the difference is
that date_bin
can truncate to an arbitrary interval.
The stride
interval must be greater than zero and
cannot contain units of month or larger.
AT TIME ZONE
The AT TIME ZONE
operator converts time
stamp without time zone to/from
time stamp with time zone, and
time with time zone
values to different time
zones. Table 9.33 shows its
variants.
Table 9.33. AT TIME ZONE
Variants
Operator Description Example(s) |
---|
Converts given time stamp without time zone to time stamp with time zone, assuming the given value is in the named time zone.
|
Converts given time stamp with time zone to time stamp without time zone, as the time would appear in that zone.
|
Converts given time with time zone to a new time zone. Since no date is supplied, this uses the currently active UTC offset for the named destination zone.
|
In these expressions, the desired time zone zone
can be
specified either as a text value (e.g., 'America/Los_Angeles'
)
or as an interval (e.g., INTERVAL '-08:00'
).
In the text case, a time zone name can be specified in any of the ways
described in Section 8.5.3.
The interval case is only useful for zones that have fixed offsets from
UTC, so it is not very common in practice.
Examples (assuming the current TimeZone setting
is America/Los_Angeles
):
SELECT TIMESTAMP '2001-02-16 20:38:40' AT TIME ZONE 'America/Denver'; Result:2001-02-16 19:38:40-08
SELECT TIMESTAMP WITH TIME ZONE '2001-02-16 20:38:40-05' AT TIME ZONE 'America/Denver'; Result:2001-02-16 18:38:40
SELECT TIMESTAMP '2001-02-16 20:38:40' AT TIME ZONE 'Asia/Tokyo' AT TIME ZONE 'America/Chicago'; Result:2001-02-16 05:38:40
The first example adds a time zone to a value that lacks it, and
displays the value using the current TimeZone
setting. The second example shifts the time stamp with time zone value
to the specified time zone, and returns the value without a time zone.
This allows storage and display of values different from the current
TimeZone
setting. The third example converts
Tokyo time to Chicago time.
The function
is equivalent to the SQL-conforming construct
timezone
(zone
,
timestamp
)
.
timestamp
AT TIME ZONE
zone
PostgreSQL provides a number of functions that return values related to the current date and time. These SQL-standard functions all return values based on the start time of the current transaction:
CURRENT_DATE CURRENT_TIME CURRENT_TIMESTAMP CURRENT_TIME(precision
) CURRENT_TIMESTAMP(precision
) LOCALTIME LOCALTIMESTAMP LOCALTIME(precision
) LOCALTIMESTAMP(precision
)
CURRENT_TIME
and
CURRENT_TIMESTAMP
deliver values with time zone;
LOCALTIME
and
LOCALTIMESTAMP
deliver values without time zone.
CURRENT_TIME
,
CURRENT_TIMESTAMP
,
LOCALTIME
, and
LOCALTIMESTAMP
can optionally take
a precision parameter, which causes the result to be rounded
to that many fractional digits in the seconds field. Without a precision parameter,
the result is given to the full available precision.
Some examples:
SELECT CURRENT_TIME; Result:14:39:53.662522-05
SELECT CURRENT_DATE; Result:2019-12-23
SELECT CURRENT_TIMESTAMP; Result:2019-12-23 14:39:53.662522-05
SELECT CURRENT_TIMESTAMP(2); Result:2019-12-23 14:39:53.66-05
SELECT LOCALTIMESTAMP; Result:2019-12-23 14:39:53.662522
Since these functions return the start time of the current transaction, their values do not change during the transaction. This is considered a feature: the intent is to allow a single transaction to have a consistent notion of the “current” time, so that multiple modifications within the same transaction bear the same time stamp.
Other database systems might advance these values more frequently.
PostgreSQL also provides functions that return the start time of the current statement, as well as the actual current time at the instant the function is called. The complete list of non-SQL-standard time functions is:
transaction_timestamp() statement_timestamp() clock_timestamp() timeofday() now()
transaction_timestamp()
is equivalent to
CURRENT_TIMESTAMP
, but is named to clearly reflect
what it returns.
statement_timestamp()
returns the start time of the current
statement (more specifically, the time of receipt of the latest command
message from the client).
statement_timestamp()
and transaction_timestamp()
return the same value during the first command of a transaction, but might
differ during subsequent commands.
clock_timestamp()
returns the actual current time, and
therefore its value changes even within a single SQL command.
timeofday()
is a historical
PostgreSQL function. Like
clock_timestamp()
, it returns the actual current time,
but as a formatted text
string rather than a timestamp
with time zone
value.
now()
is a traditional PostgreSQL
equivalent to transaction_timestamp()
.
All the date/time data types also accept the special literal value
now
to specify the current date and time (again,
interpreted as the transaction start time). Thus,
the following three all return the same result:
SELECT CURRENT_TIMESTAMP; SELECT now(); SELECT TIMESTAMP 'now'; -- but see tip below
Do not use the third form when specifying a value to be evaluated later,
for example in a DEFAULT
clause for a table column.
The system will convert now
to a timestamp
as soon as the constant is parsed, so that when
the default value is needed,
the time of the table creation would be used! The first two
forms will not be evaluated until the default value is used,
because they are function calls. Thus they will give the desired
behavior of defaulting to the time of row insertion.
(See also Section 8.5.1.4.)
The following functions are available to delay execution of the server process:
pg_sleep (double precision
) pg_sleep_for (interval
) pg_sleep_until (timestamp with time zone
)
pg_sleep
makes the current session's process
sleep until the given number of seconds have
elapsed. Fractional-second delays can be specified.
pg_sleep_for
is a convenience function to
allow the sleep time to be specified as an interval
.
pg_sleep_until
is a convenience function for when
a specific wake-up time is desired.
For example:
SELECT pg_sleep(1.5); SELECT pg_sleep_for('5 minutes'); SELECT pg_sleep_until('tomorrow 03:00');
The effective resolution of the sleep interval is platform-specific;
0.01 seconds is a common value. The sleep delay will be at least as long
as specified. It might be longer depending on factors such as server load.
In particular, pg_sleep_until
is not guaranteed to
wake up exactly at the specified time, but it will not wake up any earlier.
Make sure that your session does not hold more locks than necessary
when calling pg_sleep
or its variants. Otherwise
other sessions might have to wait for your sleeping process, slowing down
the entire system.
For enum types (described in Section 8.7), there are several functions that allow cleaner programming without hard-coding particular values of an enum type. These are listed in Table 9.34. The examples assume an enum type created as:
CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple');
Table 9.34. Enum Support Functions
Notice that except for the two-argument form of enum_range
,
these functions disregard the specific value passed to them; they care
only about its declared data type. Either null or a specific value of
the type can be passed, with the same result. It is more common to
apply these functions to a table column or function argument than to
a hardwired type name as used in the examples.
The geometric types point
, box
,
lseg
, line
, path
,
polygon
, and circle
have a large set of
native support functions and operators, shown in Table 9.35, Table 9.36, and Table 9.37.
Table 9.35. Geometric Operators
Operator Description Example(s) |
---|
Adds the coordinates of the second
|
Concatenates two open paths (returns NULL if either path is closed).
|
Subtracts the coordinates of the second
|
Multiplies each point of the first argument by the second
|
Divides each point of the first argument by the second
|
Computes the total length.
Available for
|
Computes the center point.
Available for
|
Returns the number of points.
Available for
|
Computes the point of intersection, or NULL if there is none.
Available for
|
Computes the intersection of two boxes, or NULL if there is none.
|
Computes the closest point to the first object on the second object.
Available for these pairs of types:
(
|
Computes the distance between the objects.
Available for all geometric types except
|
Does first object contain second?
Available for these pairs of types:
(
|
Is first object contained in or on second?
Available for these pairs of types:
(
|
Do these objects overlap? (One point in common makes this true.)
Available for
|
Is first object strictly left of second?
Available for
|
Is first object strictly right of second?
Available for
|
Does first object not extend to the right of second?
Available for
|
Does first object not extend to the left of second?
Available for
|
Is first object strictly below second?
Available for
|
Is first object strictly above second?
Available for
|
Does first object not extend above second?
Available for
|
Does first object not extend below second?
Available for
|
Is first object below second (allows edges to touch)?
|
Is first object above second (allows edges to touch)?
|
Do these objects intersect?
Available for these pairs of types:
(
|
Is line horizontal?
|
Are points horizontally aligned (that is, have same y coordinate)?
|
Is line vertical?
|
Are points vertically aligned (that is, have same x coordinate)?
|
Are lines perpendicular?
|
Are lines parallel?
|
Are these objects the same?
Available for
|
[a] “Rotating” a box with these operators only moves its corner points: the box is still considered to have sides parallel to the axes. Hence the box's size is not preserved, as a true rotation would do. |
Note that the “same as” operator, ~=
,
represents the usual notion of equality for the point
,
box
, polygon
, and circle
types.
Some of the geometric types also have an =
operator, but
=
compares for equal areas only.
The other scalar comparison operators (<=
and so
on), where available for these types, likewise compare areas.
Before PostgreSQL 14, the point
is strictly below/above comparison operators point
<<|
point
and point
|>>
point
were respectively
called <^
and >^
. These
names are still available, but are deprecated and will eventually be
removed.
Table 9.36. Geometric Functions
Table 9.37. Geometric Type Conversion Functions
It is possible to access the two component numbers of a point
as though the point were an array with indexes 0 and 1. For example, if
t.p
is a point
column then
SELECT p[0] FROM t
retrieves the X coordinate and
UPDATE t SET p[1] = ...
changes the Y coordinate.
In the same way, a value of type box
or lseg
can be treated
as an array of two point
values.
The IP network address types, cidr
and inet
,
support the usual comparison operators shown in
Table 9.1
as well as the specialized operators and functions shown in
Table 9.38 and
Table 9.39.
Any cidr
value can be cast to inet
implicitly;
therefore, the operators and functions shown below as operating on
inet
also work on cidr
values. (Where there are
separate functions for inet
and cidr
, it is
because the behavior should be different for the two cases.)
Also, it is permitted to cast an inet
value
to cidr
. When this is done, any bits to the right of the
netmask are silently zeroed to create a valid cidr
value.
Table 9.38. IP Address Operators
Operator Description Example(s) |
---|
Is subnet strictly contained by subnet? This operator, and the next four, test for subnet inclusion. They consider only the network parts of the two addresses (ignoring any bits to the right of the netmasks) and determine whether one network is identical to or a subnet of the other.
|
Is subnet contained by or equal to subnet?
|
Does subnet strictly contain subnet?
|
Does subnet contain or equal subnet?
|
Does either subnet contain or equal the other?
|
Computes bitwise NOT.
|
Computes bitwise AND.
|
Computes bitwise OR.
|
Adds an offset to an address.
|
Adds an offset to an address.
|
Subtracts an offset from an address.
|
Computes the difference of two addresses.
|
Table 9.39. IP Address Functions
Function Description Example(s) |
---|
Creates an abbreviated display format as text.
(The result is the same as the
|
Creates an abbreviated display format as text. (The abbreviation consists of dropping all-zero octets to the right of the netmask; more examples are in Table 8.22.)
|
Computes the broadcast address for the address's network.
|
Returns the address's family:
|
Returns the IP address as text, ignoring the netmask.
|
Computes the host mask for the address's network.
|
Computes the smallest network that includes both of the given networks.
|
Tests whether the addresses belong to the same IP family.
|
Returns the netmask length in bits.
|
Computes the network mask for the address's network.
|
Returns the network part of the address, zeroing out
whatever is to the right of the netmask.
(This is equivalent to casting the value to
|
Sets the netmask length for an
|
Sets the netmask length for a
|
Returns the unabbreviated IP address and netmask length as text.
(This has the same result as an explicit cast to
|
The abbrev
, host
,
and text
functions are primarily intended to offer
alternative display formats for IP addresses.
The MAC address types, macaddr
and macaddr8
,
support the usual comparison operators shown in
Table 9.1
as well as the specialized functions shown in
Table 9.40.
In addition, they support the bitwise logical operators
~
, &
and |
(NOT, AND and OR), just as shown above for IP addresses.
Table 9.40. MAC Address Functions
Table 9.41, Table 9.42 and Table 9.43 summarize the functions and operators that are provided for full text searching. See Chapter 12 for a detailed explanation of PostgreSQL's text search facility.
Table 9.41. Text Search Operators
Operator Description Example(s) |
---|
Does
|
Does text string, after implicit invocation
of
|
This is a deprecated synonym for
|
Concatenates two
|
ANDs two
|
ORs two
|
Negates a
|
Constructs a phrase query, which matches if the two input queries match at successive lexemes.
|
Does first
|
Is first
|
In addition to these specialized operators, the usual comparison
operators shown in Table 9.1 are
available for types tsvector
and tsquery
.
These are not very
useful for text searching but allow, for example, unique indexes to be
built on columns of these types.
Table 9.42. Text Search Functions
Function Description Example(s) |
---|
Converts an array of lexemes to a
|
Returns the OID of the current default text search configuration (as set by default_text_search_config).
|
Returns the number of lexemes in the
|
Returns the number of lexemes plus operators in
the
|
Converts text to a
|
Converts text to a
|
Converts text to a
|
Produces a representation of the indexable portion of
a
|
Assigns the specified
|
Assigns the specified
|
Removes positions and weights from the
|
Converts text to a
|
Converts text to a
|
Converts each string value in the JSON document to
a
|
Selects each item in the JSON document that is requested by
the
|
Removes any occurrence of the given
|
Removes any occurrences of the lexemes
in
|
Selects only elements with the given
|
Displays, in an abbreviated form, the match(es) for
the
|
Displays, in an abbreviated form, match(es) for
the
|
Computes a score showing how well
the
|
Computes a score showing how well
the
|
Replaces occurrences of
|
Replaces portions of the
|
Constructs a phrase query that searches
for matches of
|
Constructs a phrase query that searches
for matches of
|
Converts a
|
Expands a
lexeme | positions | weights --------+-----------+--------- cat | {3} | {D} fat | {2,4} | {D,D} rat | {5} | {A}
|
All the text search functions that accept an optional regconfig
argument will use the configuration specified by
default_text_search_config
when that argument is omitted.
The functions in Table 9.43 are listed separately because they are not usually used in everyday text searching operations. They are primarily helpful for development and debugging of new text search configurations.
Table 9.43. Text Search Debugging Functions
Function Description Example(s) |
---|
Extracts and normalizes tokens from
the
|
Returns an array of replacement lexemes if the input token is known to the dictionary, or an empty array if the token is known to the dictionary but it is a stop word, or NULL if it is not a known word. See Section 12.8.3 for details.
|
Extracts tokens from the
|
Extracts tokens from the
|
Returns a table that describes each type of token the named parser can recognize. See Section 12.8.2 for details.
|
Returns a table that describes each type of token a parser specified by OID can recognize. See Section 12.8.2 for details.
|
Executes the
|
PostgreSQL includes one function to generate a UUID:
gen_random_uuid
() →uuid
This function returns a version 4 (random) UUID. This is the most commonly used type of UUID and is appropriate for most applications.
The uuid-ossp module provides additional functions that implement other standard algorithms for generating UUIDs.
PostgreSQL also provides the usual comparison operators shown in Table 9.1 for UUIDs.
The functions and function-like expressions described in this
section operate on values of type xml
. See Section 8.13 for information about the xml
type. The function-like expressions xmlparse
and xmlserialize
for converting to and from
type xml
are documented there, not in this section.
Use of most of these functions
requires PostgreSQL to have been built
with configure --with-libxml
.
A set of functions and function-like expressions is available for producing XML content from SQL data. As such, they are particularly suitable for formatting query results into XML documents for processing in client applications.
xmlcomment
xmlcomment
(text
) →xml
The function xmlcomment
creates an XML value
containing an XML comment with the specified text as content.
The text cannot contain “--
” or end with a
“-
”, otherwise the resulting construct
would not be a valid XML comment.
If the argument is null, the result is null.
Example:
SELECT xmlcomment('hello'); xmlcomment -------------- <!--hello-->
xmlconcat
xmlconcat
(xml
[, ...] ) →xml
The function xmlconcat
concatenates a list
of individual XML values to create a single value containing an
XML content fragment. Null values are omitted; the result is
only null if there are no nonnull arguments.
Example:
SELECT xmlconcat('<abc/>', '<bar>foo</bar>'); xmlconcat ---------------------- <abc/><bar>foo</bar>
XML declarations, if present, are combined as follows. If all argument values have the same XML version declaration, that version is used in the result, else no version is used. If all argument values have the standalone declaration value “yes”, then that value is used in the result. If all argument values have a standalone declaration value and at least one is “no”, then that is used in the result. Else the result will have no standalone declaration. If the result is determined to require a standalone declaration but no version declaration, a version declaration with version 1.0 will be used because XML requires an XML declaration to contain a version declaration. Encoding declarations are ignored and removed in all cases.
Example:
SELECT xmlconcat('<?xml version="1.1"?><foo/>', '<?xml version="1.1" standalone="no"?><bar/>'); xmlconcat ----------------------------------- <?xml version="1.1"?><foo/><bar/>
xmlelement
xmlelement
(NAME
name
[,XMLATTRIBUTES
(attvalue
[AS
attname
] [, ...] ) ] [,content
[, ...]] ) →xml
The xmlelement
expression produces an XML
element with the given name, attributes, and content.
The name
and attname
items shown in the syntax are
simple identifiers, not values. The attvalue
and content
items are expressions, which can
yield any PostgreSQL data type. The
argument(s) within XMLATTRIBUTES
generate attributes
of the XML element; the content
value(s) are
concatenated to form its content.
Examples:
SELECT xmlelement(name foo); xmlelement ------------ <foo/> SELECT xmlelement(name foo, xmlattributes('xyz' as bar)); xmlelement ------------------ <foo bar="xyz"/> SELECT xmlelement(name foo, xmlattributes(current_date as bar), 'cont', 'ent'); xmlelement ------------------------------------- <foo bar="2007-01-26">content</foo>
Element and attribute names that are not valid XML names are
escaped by replacing the offending characters by the sequence
_x
, where
HHHH
_HHHH
is the character's Unicode
codepoint in hexadecimal notation. For example:
SELECT xmlelement(name "foo$bar", xmlattributes('xyz' as "a&b")); xmlelement ---------------------------------- <foo_x0024_bar a_x0026_b="xyz"/>
An explicit attribute name need not be specified if the attribute value is a column reference, in which case the column's name will be used as the attribute name by default. In other cases, the attribute must be given an explicit name. So this example is valid:
CREATE TABLE test (a xml, b xml); SELECT xmlelement(name test, xmlattributes(a, b)) FROM test;
But these are not:
SELECT xmlelement(name test, xmlattributes('constant'), a, b) FROM test; SELECT xmlelement(name test, xmlattributes(func(a, b))) FROM test;
Element content, if specified, will be formatted according to
its data type. If the content is itself of type xml
,
complex XML documents can be constructed. For example:
SELECT xmlelement(name foo, xmlattributes('xyz' as bar), xmlelement(name abc), xmlcomment('test'), xmlelement(name xyz)); xmlelement ---------------------------------------------- <foo bar="xyz"><abc/><!--test--><xyz/></foo>
Content of other types will be formatted into valid XML character
data. This means in particular that the characters <, >,
and & will be converted to entities. Binary data (data type
bytea
) will be represented in base64 or hex
encoding, depending on the setting of the configuration parameter
xmlbinary. The particular behavior for
individual data types is expected to evolve in order to align the
PostgreSQL mappings with those specified in SQL:2006 and later,
as discussed in Section D.3.1.3.
xmlforest
xmlforest
(content
[AS
name
] [, ...] ) →xml
The xmlforest
expression produces an XML
forest (sequence) of elements using the given names and content.
As for xmlelement
,
each name
must be a simple identifier, while
the content
expressions can have any data
type.
Examples:
SELECT xmlforest('abc' AS foo, 123 AS bar); xmlforest ------------------------------ <foo>abc</foo><bar>123</bar> SELECT xmlforest(table_name, column_name) FROM information_schema.columns WHERE table_schema = 'pg_catalog'; xmlforest ----------------------------------------------------------------------- <table_name>pg_authid</table_name><column_name>rolname</column_name> <table_name>pg_authid</table_name><column_name>rolsuper</column_name> ...
As seen in the second example, the element name can be omitted if the content value is a column reference, in which case the column name is used by default. Otherwise, a name must be specified.
Element names that are not valid XML names are escaped as shown
for xmlelement
above. Similarly, content
data is escaped to make valid XML content, unless it is already
of type xml
.
Note that XML forests are not valid XML documents if they consist
of more than one element, so it might be useful to wrap
xmlforest
expressions in
xmlelement
.
xmlpi
xmlpi
(NAME
name
[,content
] ) →xml
The xmlpi
expression creates an XML
processing instruction.
As for xmlelement
,
the name
must be a simple identifier, while
the content
expression can have any data type.
The content
, if present, must not contain the
character sequence ?>
.
Example:
SELECT xmlpi(name php, 'echo "hello world";'); xmlpi ----------------------------- <?php echo "hello world";?>
xmlroot
xmlroot
(xml
,VERSION
{text
|NO VALUE
} [,STANDALONE
{YES
|NO
|NO VALUE
} ] ) →xml
The xmlroot
expression alters the properties
of the root node of an XML value. If a version is specified,
it replaces the value in the root node's version declaration; if a
standalone setting is specified, it replaces the value in the
root node's standalone declaration.
SELECT xmlroot(xmlparse(document '<?xml version="1.1"?><content>abc</content>'), version '1.0', standalone yes); xmlroot ---------------------------------------- <?xml version="1.0" standalone="yes"?> <content>abc</content>
xmlagg
xmlagg
(xml
) →xml
The function xmlagg
is, unlike the other
functions described here, an aggregate function. It concatenates the
input values to the aggregate function call,
much like xmlconcat
does, except that concatenation
occurs across rows rather than across expressions in a single row.
See Section 9.21 for additional information
about aggregate functions.
Example:
CREATE TABLE test (y int, x xml); INSERT INTO test VALUES (1, '<foo>abc</foo>'); INSERT INTO test VALUES (2, '<bar/>'); SELECT xmlagg(x) FROM test; xmlagg ---------------------- <foo>abc</foo><bar/>
To determine the order of the concatenation, an ORDER BY
clause may be added to the aggregate call as described in
Section 4.2.7. For example:
SELECT xmlagg(x ORDER BY y DESC) FROM test; xmlagg ---------------------- <bar/><foo>abc</foo>
The following non-standard approach used to be recommended in previous versions, and may still be useful in specific cases:
SELECT xmlagg(x) FROM (SELECT * FROM test ORDER BY y DESC) AS tab; xmlagg ---------------------- <bar/><foo>abc</foo>
The expressions described in this section check properties
of xml
values.
IS DOCUMENT
xml
IS DOCUMENT
→boolean
The expression IS DOCUMENT
returns true if the
argument XML value is a proper XML document, false if it is not
(that is, it is a content fragment), or null if the argument is
null. See Section 8.13 about the difference
between documents and content fragments.
IS NOT DOCUMENT
xml
IS NOT DOCUMENT
→boolean
The expression IS NOT DOCUMENT
returns false if the
argument XML value is a proper XML document, true if it is not (that is,
it is a content fragment), or null if the argument is null.
XMLEXISTS
XMLEXISTS
(text
PASSING
[BY
{REF
|VALUE
}]xml
[BY
{REF
|VALUE
}] ) →boolean
The function xmlexists
evaluates an XPath 1.0
expression (the first argument), with the passed XML value as its context
item. The function returns false if the result of that evaluation
yields an empty node-set, true if it yields any other value. The
function returns null if any argument is null. A nonnull value
passed as the context item must be an XML document, not a content
fragment or any non-XML value.
Example:
SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY VALUE '<towns><town>Toronto</town><town>Ottawa</town></towns>'); xmlexists ------------ t (1 row)
The BY REF
and BY VALUE
clauses
are accepted in PostgreSQL, but are ignored,
as discussed in Section D.3.2.
In the SQL standard, the xmlexists
function
evaluates an expression in the XML Query language,
but PostgreSQL allows only an XPath 1.0
expression, as discussed in
Section D.3.1.
xml_is_well_formed
xml_is_well_formed
(text
) →boolean
xml_is_well_formed_document
(text
) →boolean
xml_is_well_formed_content
(text
) →boolean
These functions check whether a text
string represents
well-formed XML, returning a Boolean result.
xml_is_well_formed_document
checks for a well-formed
document, while xml_is_well_formed_content
checks
for well-formed content. xml_is_well_formed
does
the former if the xmloption configuration
parameter is set to DOCUMENT
, or the latter if it is set to
CONTENT
. This means that
xml_is_well_formed
is useful for seeing whether
a simple cast to type xml
will succeed, whereas the other two
functions are useful for seeing whether the corresponding variants of
XMLPARSE
will succeed.
Examples:
SET xmloption TO DOCUMENT; SELECT xml_is_well_formed('<>'); xml_is_well_formed -------------------- f (1 row) SELECT xml_is_well_formed('<abc/>'); xml_is_well_formed -------------------- t (1 row) SET xmloption TO CONTENT; SELECT xml_is_well_formed('abc'); xml_is_well_formed -------------------- t (1 row) SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>'); xml_is_well_formed_document ----------------------------- t (1 row) SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>'); xml_is_well_formed_document ----------------------------- f (1 row)
The last example shows that the checks include whether namespaces are correctly matched.
To process values of data type xml
, PostgreSQL offers
the functions xpath
and
xpath_exists
, which evaluate XPath 1.0
expressions, and the XMLTABLE
table function.
xpath
xpath
(xpath
text
,xml
xml
[,nsarray
text[]
] ) →xml[]
The function xpath
evaluates the XPath 1.0
expression xpath
(given as text)
against the XML value
xml
. It returns an array of XML values
corresponding to the node-set produced by the XPath expression.
If the XPath expression returns a scalar value rather than a node-set,
a single-element array is returned.
The second argument must be a well formed XML document. In particular, it must have a single root node element.
The optional third argument of the function is an array of namespace
mappings. This array should be a two-dimensional text
array with
the length of the second axis being equal to 2 (i.e., it should be an
array of arrays, each of which consists of exactly 2 elements).
The first element of each array entry is the namespace name (alias), the
second the namespace URI. It is not required that aliases provided in
this array be the same as those being used in the XML document itself (in
other words, both in the XML document and in the xpath
function context, aliases are local).
Example:
SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>', ARRAY[ARRAY['my', 'http://example.com']]); xpath -------- {test} (1 row)
To deal with default (anonymous) namespaces, do something like this:
SELECT xpath('//mydefns:b/text()', '<a xmlns="http://example.com"><b>test</b></a>', ARRAY[ARRAY['mydefns', 'http://example.com']]); xpath -------- {test} (1 row)
xpath_exists
xpath_exists
(xpath
text
,xml
xml
[,nsarray
text[]
] ) →boolean
The function xpath_exists
is a specialized form
of the xpath
function. Instead of returning the
individual XML values that satisfy the XPath 1.0 expression, this function
returns a Boolean indicating whether the query was satisfied or not
(specifically, whether it produced any value other than an empty node-set).
This function is equivalent to the XMLEXISTS
predicate,
except that it also offers support for a namespace mapping argument.
Example:
SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>', ARRAY[ARRAY['my', 'http://example.com']]); xpath_exists -------------- t (1 row)
xmltable
XMLTABLE
( [XMLNAMESPACES
(namespace_uri
AS
namespace_name
[, ...] ), ]row_expression
PASSING
[BY
{REF
|VALUE
}]document_expression
[BY
{REF
|VALUE
}]COLUMNS
name
{type
[PATH
column_expression
] [DEFAULT
default_expression
] [NOT NULL
|NULL
] |FOR ORDINALITY
} [, ...] ) →setof record
The xmltable
expression produces a table based
on an XML value, an XPath filter to extract rows, and a
set of column definitions.
Although it syntactically resembles a function, it can only appear
as a table in a query's FROM
clause.
The optional XMLNAMESPACES
clause gives a
comma-separated list of namespace definitions, where
each namespace_uri
is a text
expression and each namespace_name
is a simple
identifier. It specifies the XML namespaces used in the document and
their aliases. A default namespace specification is not currently
supported.
The required row_expression
argument is an
XPath 1.0 expression (given as text
) that is evaluated,
passing the XML value document_expression
as
its context item, to obtain a set of XML nodes. These nodes are what
xmltable
transforms into output rows. No rows
will be produced if the document_expression
is null, nor if the row_expression
produces
an empty node-set or any value other than a node-set.
document_expression
provides the context
item for the row_expression
. It must be a
well-formed XML document; fragments/forests are not accepted.
The BY REF
and BY VALUE
clauses
are accepted but ignored, as discussed in
Section D.3.2.
In the SQL standard, the xmltable
function
evaluates expressions in the XML Query language,
but PostgreSQL allows only XPath 1.0
expressions, as discussed in
Section D.3.1.
The required COLUMNS
clause specifies the
column(s) that will be produced in the output table.
See the syntax summary above for the format.
A name is required for each column, as is a data type
(unless FOR ORDINALITY
is specified, in which case
type integer
is implicit). The path, default and
nullability clauses are optional.
A column marked FOR ORDINALITY
will be populated
with row numbers, starting with 1, in the order of nodes retrieved from
the row_expression
's result node-set.
At most one column may be marked FOR ORDINALITY
.
XPath 1.0 does not specify an order for nodes in a node-set, so code that relies on a particular order of the results will be implementation-dependent. Details can be found in Section D.3.1.2.
The column_expression
for a column is an
XPath 1.0 expression that is evaluated for each row, with the current
node from the row_expression
result as its
context item, to find the value of the column. If
no column_expression
is given, then the
column name is used as an implicit path.
If a column's XPath expression returns a non-XML value (which is limited
to string, boolean, or double in XPath 1.0) and the column has a
PostgreSQL type other than xml
, the column will be set
as if by assigning the value's string representation to the PostgreSQL
type. (If the value is a boolean, its string representation is taken
to be 1
or 0
if the output
column's type category is numeric, otherwise true
or
false
.)
If a column's XPath expression returns a non-empty set of XML nodes
and the column's PostgreSQL type is xml
, the column will
be assigned the expression result exactly, if it is of document or
content form.
[8]
A non-XML result assigned to an xml
output column produces
content, a single text node with the string value of the result.
An XML result assigned to a column of any other type may not have more than
one node, or an error is raised. If there is exactly one node, the column
will be set as if by assigning the node's string
value (as defined for the XPath 1.0 string
function)
to the PostgreSQL type.
The string value of an XML element is the concatenation, in document order,
of all text nodes contained in that element and its descendants. The string
value of an element with no descendant text nodes is an
empty string (not NULL
).
Any xsi:nil
attributes are ignored.
Note that the whitespace-only text()
node between two non-text
elements is preserved, and that leading whitespace on a text()
node is not flattened.
The XPath 1.0 string
function may be consulted for the
rules defining the string value of other XML node types and non-XML values.
The conversion rules presented here are not exactly those of the SQL standard, as discussed in Section D.3.1.3.
If the path expression returns an empty node-set
(typically, when it does not match)
for a given row, the column will be set to NULL
, unless
a default_expression
is specified; then the
value resulting from evaluating that expression is used.
A default_expression
, rather than being
evaluated immediately when xmltable
is called,
is evaluated each time a default is needed for the column.
If the expression qualifies as stable or immutable, the repeat
evaluation may be skipped.
This means that you can usefully use volatile functions like
nextval
in
default_expression
.
Columns may be marked NOT NULL
. If the
column_expression
for a NOT
NULL
column does not match anything and there is
no DEFAULT
or
the default_expression
also evaluates to null,
an error is reported.
Examples:
CREATE TABLE xmldata AS SELECT xml $$ <ROWS> <ROW id="1"> <COUNTRY_ID>AU</COUNTRY_ID> <COUNTRY_NAME>Australia</COUNTRY_NAME> </ROW> <ROW id="5"> <COUNTRY_ID>JP</COUNTRY_ID> <COUNTRY_NAME>Japan</COUNTRY_NAME> <PREMIER_NAME>Shinzo Abe</PREMIER_NAME> <SIZE unit="sq_mi">145935</SIZE> </ROW> <ROW id="6"> <COUNTRY_ID>SG</COUNTRY_ID> <COUNTRY_NAME>Singapore</COUNTRY_NAME> <SIZE unit="sq_km">697</SIZE> </ROW> </ROWS> $$ AS data; SELECT xmltable.* FROM xmldata, XMLTABLE('//ROWS/ROW' PASSING data COLUMNS id int PATH '@id', ordinality FOR ORDINALITY, "COUNTRY_NAME" text, country_id text PATH 'COUNTRY_ID', size_sq_km float PATH 'SIZE[@unit = "sq_km"]', size_other text PATH 'concat(SIZE[@unit!="sq_km"], " ", SIZE[@unit!="sq_km"]/@unit)', premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified'); id | ordinality | COUNTRY_NAME | country_id | size_sq_km | size_other | premier_name ----+------------+--------------+------------+------------+--------------+--------------- 1 | 1 | Australia | AU | | | not specified 5 | 2 | Japan | JP | | 145935 sq_mi | Shinzo Abe 6 | 3 | Singapore | SG | 697 | | not specified
The following example shows concatenation of multiple text() nodes, usage of the column name as XPath filter, and the treatment of whitespace, XML comments and processing instructions:
CREATE TABLE xmlelements AS SELECT xml $$ <root> <element> Hello<!-- xyxxz -->2a2<?aaaaa?> <!--x--> bbb<x>xxx</x>CC </element> </root> $$ AS data; SELECT xmltable.* FROM xmlelements, XMLTABLE('/root' PASSING data COLUMNS element text); element ------------------------- Hello2a2 bbbxxxCC
The following example illustrates how
the XMLNAMESPACES
clause can be used to specify
a list of namespaces
used in the XML document as well as in the XPath expressions:
WITH xmldata(data) AS (VALUES (' <example xmlns="http://example.com/myns" xmlns:B="http://example.com/b"> <item foo="1" B:bar="2"/> <item foo="3" B:bar="4"/> <item foo="4" B:bar="5"/> </example>'::xml) ) SELECT xmltable.* FROM XMLTABLE(XMLNAMESPACES('http://example.com/myns' AS x, 'http://example.com/b' AS "B"), '/x:example/x:item' PASSING (SELECT data FROM xmldata) COLUMNS foo int PATH '@foo', bar int PATH '@B:bar'); foo | bar -----+----- 1 | 2 3 | 4 4 | 5 (3 rows)
The following functions map the contents of relational tables to XML values. They can be thought of as XML export functionality:
table_to_xml
(table
regclass
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
query_to_xml
(query
text
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
cursor_to_xml
(cursor
refcursor
,count
integer
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
table_to_xml
maps the content of the named
table, passed as parameter table
. The
regclass
type accepts strings identifying tables using the
usual notation, including optional schema qualification and
double quotes (see Section 8.19 for details).
query_to_xml
executes the
query whose text is passed as parameter
query
and maps the result set.
cursor_to_xml
fetches the indicated number of
rows from the cursor specified by the parameter
cursor
. This variant is recommended if
large tables have to be mapped, because the result value is built
up in memory by each function.
If tableforest
is false, then the resulting
XML document looks like this:
<tablename> <row> <columnname1>data</columnname1> <columnname2>data</columnname2> </row> <row> ... </row> ... </tablename>
If tableforest
is true, the result is an
XML content fragment that looks like this:
<tablename> <columnname1>data</columnname1> <columnname2>data</columnname2> </tablename> <tablename> ... </tablename> ...
If no table name is available, that is, when mapping a query or a
cursor, the string table
is used in the first
format, row
in the second format.
The choice between these formats is up to the user. The first
format is a proper XML document, which will be important in many
applications. The second format tends to be more useful in the
cursor_to_xml
function if the result values are to be
reassembled into one document later on. The functions for
producing XML content discussed above, in particular
xmlelement
, can be used to alter the results
to taste.
The data values are mapped in the same way as described for the
function xmlelement
above.
The parameter nulls
determines whether null
values should be included in the output. If true, null values in
columns are represented as:
<columnname xsi:nil="true"/>
where xsi
is the XML namespace prefix for XML
Schema Instance. An appropriate namespace declaration will be
added to the result value. If false, columns containing null
values are simply omitted from the output.
The parameter targetns
specifies the
desired XML namespace of the result. If no particular namespace
is wanted, an empty string should be passed.
The following functions return XML Schema documents describing the mappings performed by the corresponding functions above:
table_to_xmlschema
(table
regclass
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
query_to_xmlschema
(query
text
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
cursor_to_xmlschema
(cursor
refcursor
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
It is essential that the same parameters are passed in order to obtain matching XML data mappings and XML Schema documents.
The following functions produce XML data mappings and the corresponding XML Schema in one document (or forest), linked together. They can be useful where self-contained and self-describing results are wanted:
table_to_xml_and_xmlschema
(table
regclass
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
query_to_xml_and_xmlschema
(query
text
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
In addition, the following functions are available to produce analogous mappings of entire schemas or the entire current database:
schema_to_xml
(schema
name
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
schema_to_xmlschema
(schema
name
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
schema_to_xml_and_xmlschema
(schema
name
,nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
database_to_xml
(nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
database_to_xmlschema
(nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
database_to_xml_and_xmlschema
(nulls
boolean
,tableforest
boolean
,targetns
text
) →xml
These functions ignore tables that are not readable by the current user.
The database-wide functions additionally ignore schemas that the current
user does not have USAGE
(lookup) privilege for.
Note that these potentially produce a lot of data, which needs to be built up in memory. When requesting content mappings of large schemas or databases, it might be worthwhile to consider mapping the tables separately instead, possibly even through a cursor.
The result of a schema content mapping looks like this:
<schemaname> table1-mapping table2-mapping ... </schemaname>
where the format of a table mapping depends on the
tableforest
parameter as explained above.
The result of a database content mapping looks like this:
<dbname> <schema1name> ... </schema1name> <schema2name> ... </schema2name> ... </dbname>
where the schema mapping is as above.
As an example of using the output produced by these functions,
Example 9.1 shows an XSLT stylesheet that
converts the output of
table_to_xml_and_xmlschema
to an HTML
document containing a tabular rendition of the table data. In a
similar manner, the results from these functions can be
converted into other XML-based formats.
Example 9.1. XSLT Stylesheet for Converting SQL/XML Output to HTML
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/1999/xhtml" > <xsl:output method="xml" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" doctype-public="-//W3C/DTD XHTML 1.0 Strict//EN" indent="yes"/> <xsl:template match="/*"> <xsl:variable name="schema" select="//xsd:schema"/> <xsl:variable name="tabletypename" select="$schema/xsd:element[@name=name(current())]/@type"/> <xsl:variable name="rowtypename" select="$schema/xsd:complexType[@name=$tabletypename]/xsd:sequence/xsd:element[@name='row']/@type"/> <html> <head> <title><xsl:value-of select="name(current())"/></title> </head> <body> <table> <tr> <xsl:for-each select="$schema/xsd:complexType[@name=$rowtypename]/xsd:sequence/xsd:element/@name"> <th><xsl:value-of select="."/></th> </xsl:for-each> </tr> <xsl:for-each select="row"> <tr> <xsl:for-each select="*"> <td><xsl:value-of select="."/></td> </xsl:for-each> </tr> </xsl:for-each> </table> </body> </html> </xsl:template> </xsl:stylesheet>
This section describes:
functions and operators for processing and creating JSON data
the SQL/JSON path language
To learn more about the SQL/JSON standard, see [sqltr-19075-6]. For details on JSON types supported in PostgreSQL, see Section 8.14.
Table 9.44 shows the operators that
are available for use with JSON data types (see Section 8.14).
In addition, the usual comparison operators shown in Table 9.1 are available for
jsonb
, though not for json
. The comparison
operators follow the ordering rules for B-tree operations outlined in
Section 8.14.4.
See also Section 9.21 for the aggregate
function json_agg
which aggregates record
values as JSON, the aggregate function
json_object_agg
which aggregates pairs of values
into a JSON object, and their jsonb
equivalents,
jsonb_agg
and jsonb_object_agg
.
Table 9.44. json
and jsonb
Operators
Operator Description Example(s) |
---|
Extracts
|
Extracts JSON object field with the given key.
|
Extracts
|
Extracts JSON object field with the given key, as
|
Extracts JSON sub-object at the specified path, where path elements can be either field keys or array indexes.
|
Extracts JSON sub-object at the specified path as
|
The field/element/path extraction operators return NULL, rather than failing, if the JSON input does not have the right structure to match the request; for example if no such key or array element exists.
Some further operators exist only for jsonb
, as shown
in Table 9.45.
Section 8.14.4
describes how these operators can be used to effectively search indexed
jsonb
data.
Table 9.45. Additional jsonb
Operators
Operator Description Example(s) |
---|
Does the first JSON value contain the second? (See Section 8.14.3 for details about containment.)
|
Is the first JSON value contained in the second?
|
Does the text string exist as a top-level key or array element within the JSON value?
|
Do any of the strings in the text array exist as top-level keys or array elements?
|
Do all of the strings in the text array exist as top-level keys or array elements?
|
Concatenates two
To append an array to another array as a single entry, wrap it in an additional layer of array, for example:
|
Deletes a key (and its value) from a JSON object, or matching string value(s) from a JSON array.
|
Deletes all matching keys or array elements from the left operand.
|
Deletes the array element with specified index (negative integers count from the end). Throws an error if JSON value is not an array.
|
Deletes the field or array element at the specified path, where path elements can be either field keys or array indexes.
|
Does JSON path return any item for the specified JSON value?
|
Returns the result of a JSON path predicate check for the
specified JSON value. Only the first item of the result is taken into
account. If the result is not Boolean, then
|
The jsonpath
operators @?
and @@
suppress the following errors: missing object
field or array element, unexpected JSON item type, datetime and numeric
errors. The jsonpath
-related functions described below can
also be told to suppress these types of errors. This behavior might be
helpful when searching JSON document collections of varying structure.
Table 9.46 shows the functions that are
available for constructing json
and jsonb
values.
Table 9.46. JSON Creation Functions
Function Description Example(s) |
---|
Converts any SQL value to
|
Converts an SQL array to a JSON array. The behavior is the same
as
|
Converts an SQL composite value to a JSON object. The behavior is the
same as
|
Builds a possibly-heterogeneously-typed JSON array out of a variadic
argument list. Each argument is converted as
per
|
Builds a JSON object out of a variadic argument list. By convention,
the argument list consists of alternating keys and values. Key
arguments are coerced to text; value arguments are converted as
per
|
Builds a JSON object out of a text array. The array must have either exactly one dimension with an even number of members, in which case they are taken as alternating key/value pairs, or two dimensions such that each inner array has exactly two elements, which are taken as a key/value pair. All values are converted to JSON strings.
|
This form of
|
Table 9.47 shows the functions that
are available for processing json
and jsonb
values.
Table 9.47. JSON Processing Functions
Function Description Example(s) |
---|
Expands the top-level JSON array into a set of JSON values.
value ----------- 1 true [2,false]
|
Expands the top-level JSON array into a set of
value ----------- foo bar
|
Returns the number of elements in the top-level JSON array.
|
Expands the top-level JSON object into a set of key/value pairs.
key | value -----+------- a | "foo" b | "bar"
|
Expands the top-level JSON object into a set of key/value pairs.
The returned
key | value -----+------- a | foo b | bar
|
Extracts JSON sub-object at the specified path.
(This is functionally equivalent to the
|
Extracts JSON sub-object at the specified path as
|
Returns the set of keys in the top-level JSON object.
json_object_keys ------------------ f1 f2
|
Expands the top-level JSON object to a row having the composite type
of the To convert a JSON value to the SQL type of an output column, the following rules are applied in sequence:
While the example below uses a constant JSON value, typical use would
be to reference a
a | b | c ---+-----------+------------- 1 | {2,"a b"} | (4,"a b c")
|
Expands the top-level JSON array of objects to a set of rows having
the composite type of the
a | b ---+--- 1 | 2 3 | 4
|
Expands the top-level JSON object to a row having the composite type
defined by an
a | b | c | d | r ---+---------+---------+---+--------------- 1 | [1,2,3] | {1,2,3} | | (123,"a b c")
|
Expands the top-level JSON array of objects to a set of rows having
the composite type defined by an
a | b ---+----- 1 | foo 2 |
|
Returns
|
If
|
Returns
|
Deletes all object fields that have null values from the given JSON value, recursively. Null values that are not object fields are untouched.
|
Checks whether the JSON path returns any item for the specified JSON
value.
If the
|
Returns the result of a JSON path predicate check for the specified
JSON value. Only the first item of the result is taken into account.
If the result is not Boolean, then
|
Returns all JSON items returned by the JSON path for the specified
JSON value.
The optional
jsonb_path_query ------------------ 2 3 4
|
Returns all JSON items returned by the JSON path for the specified
JSON value, as a JSON array.
The optional
|
Returns the first JSON item returned by the JSON path for the
specified JSON value. Returns
|
These functions act like their counterparts described above without
the
|
Converts the given JSON value to pretty-printed, indented text.
[ { "f1": 1, "f2": null }, 2 ]
|
Returns the type of the top-level JSON value as a text string.
Possible types are
|
SQL/JSON path expressions specify the items to be retrieved
from the JSON data, similar to XPath expressions used
for SQL access to XML. In PostgreSQL,
path expressions are implemented as the jsonpath
data type and can use any elements described in
Section 8.14.7.
JSON query functions and operators pass the provided path expression to the path engine for evaluation. If the expression matches the queried JSON data, the corresponding JSON item, or set of items, is returned. Path expressions are written in the SQL/JSON path language and can include arithmetic expressions and functions.
A path expression consists of a sequence of elements allowed
by the jsonpath
data type.
The path expression is normally evaluated from left to right, but
you can use parentheses to change the order of operations.
If the evaluation is successful, a sequence of JSON items is produced,
and the evaluation result is returned to the JSON query function
that completes the specified computation.
To refer to the JSON value being queried (the
context item), use the $
variable
in the path expression. It can be followed by one or more
accessor operators,
which go down the JSON structure level by level to retrieve sub-items
of the context item. Each operator that follows deals with the
result of the previous evaluation step.
For example, suppose you have some JSON data from a GPS tracker that you would like to parse, such as:
{ "track": { "segments": [ { "location": [ 47.763, 13.4034 ], "start time": "2018-10-14 10:05:14", "HR": 73 }, { "location": [ 47.706, 13.2635 ], "start time": "2018-10-14 10:39:21", "HR": 135 } ] } }
To retrieve the available track segments, you need to use the
.
accessor
operator to descend through surrounding JSON objects:
key
$.track.segments
To retrieve the contents of an array, you typically use the
[*]
operator. For example,
the following path will return the location coordinates for all
the available track segments:
$.track.segments[*].location
To return the coordinates of the first segment only, you can
specify the corresponding subscript in the []
accessor operator. Recall that JSON array indexes are 0-relative:
$.track.segments[0].location
The result of each path evaluation step can be processed
by one or more jsonpath
operators and methods
listed in Section 9.16.2.2.
Each method name must be preceded by a dot. For example,
you can get the size of an array:
$.track.segments.size()
More examples of using jsonpath
operators
and methods within path expressions appear below in
Section 9.16.2.2.
When defining a path, you can also use one or more
filter expressions that work similarly to the
WHERE
clause in SQL. A filter expression begins with
a question mark and provides a condition in parentheses:
? (condition
)
Filter expressions must be written just after the path evaluation step
to which they should apply. The result of that step is filtered to include
only those items that satisfy the provided condition. SQL/JSON defines
three-valued logic, so the condition can be true
, false
,
or unknown
. The unknown
value
plays the same role as SQL NULL
and can be tested
for with the is unknown
predicate. Further path
evaluation steps use only those items for which the filter expression
returned true
.
The functions and operators that can be used in filter expressions are
listed in Table 9.49. Within a
filter expression, the @
variable denotes the value
being filtered (i.e., one result of the preceding path step). You can
write accessor operators after @
to retrieve component
items.
For example, suppose you would like to retrieve all heart rate values higher than 130. You can achieve this using the following expression:
$.track.segments[*].HR ? (@ > 130)
To get the start times of segments with such values, you have to filter out irrelevant segments before returning the start times, so the filter expression is applied to the previous step, and the path used in the condition is different:
$.track.segments[*] ? (@.HR > 130)."start time"
You can use several filter expressions in sequence, if required. For example, the following expression selects start times of all segments that contain locations with relevant coordinates and high heart rate values:
$.track.segments[*] ? (@.location[1] < 13.4) ? (@.HR > 130)."start time"
Using filter expressions at different nesting levels is also allowed. The following example first filters all segments by location, and then returns high heart rate values for these segments, if available:
$.track.segments[*] ? (@.location[1] < 13.4).HR ? (@ > 130)
You can also nest filter expressions within each other:
$.track ? (exists(@.segments[*] ? (@.HR > 130))).segments.size()
This expression returns the size of the track if it contains any segments with high heart rate values, or an empty sequence otherwise.
PostgreSQL's implementation of the SQL/JSON path language has the following deviations from the SQL/JSON standard:
A path expression can be a Boolean predicate, although the SQL/JSON
standard allows predicates only in filters. This is necessary for
implementation of the @@
operator. For example,
the following jsonpath
expression is valid in
PostgreSQL:
$.track.segments[*].HR < 70
There are minor differences in the interpretation of regular
expression patterns used in like_regex
filters, as
described in Section 9.16.2.3.
When you query JSON data, the path expression may not match the actual JSON data structure. An attempt to access a non-existent member of an object or element of an array results in a structural error. SQL/JSON path expressions have two modes of handling structural errors:
lax (default) — the path engine implicitly adapts the queried data to the specified path. Any remaining structural errors are suppressed and converted to empty SQL/JSON sequences.
strict — if a structural error occurs, an error is raised.
The lax mode facilitates matching of a JSON document structure and path expression if the JSON data does not conform to the expected schema. If an operand does not match the requirements of a particular operation, it can be automatically wrapped as an SQL/JSON array or unwrapped by converting its elements into an SQL/JSON sequence before performing this operation. Besides, comparison operators automatically unwrap their operands in the lax mode, so you can compare SQL/JSON arrays out-of-the-box. An array of size 1 is considered equal to its sole element. Automatic unwrapping is not performed only when:
The path expression contains type()
or
size()
methods that return the type
and the number of elements in the array, respectively.
The queried JSON data contain nested arrays. In this case, only the outermost array is unwrapped, while all the inner arrays remain unchanged. Thus, implicit unwrapping can only go one level down within each path evaluation step.
For example, when querying the GPS data listed above, you can abstract from the fact that it stores an array of segments when using the lax mode:
lax $.track.segments.location
In the strict mode, the specified path must exactly match the structure of
the queried JSON document to return an SQL/JSON item, so using this
path expression will cause an error. To get the same result as in
the lax mode, you have to explicitly unwrap the
segments
array:
strict $.track.segments[*].location
The .**
accessor can lead to surprising results
when using the lax mode. For instance, the following query selects every
HR
value twice:
lax $.**.HR
This happens because the .**
accessor selects both
the segments
array and each of its elements, while
the .HR
accessor automatically unwraps arrays when
using the lax mode. To avoid surprising results, we recommend using
the .**
accessor only in the strict mode. The
following query selects each HR
value just once:
strict $.**.HR
Table 9.48 shows the operators and
methods available in jsonpath
. Note that while the unary
operators and methods can be applied to multiple values resulting from a
preceding path step, the binary operators (addition etc.) can only be
applied to single values.
Table 9.48. jsonpath
Operators and Methods
Operator/Method Description Example(s) |
---|
Addition
|
Unary plus (no operation); unlike addition, this can iterate over multiple values
|
Subtraction
|
Negation; unlike subtraction, this can iterate over multiple values
|
Multiplication
|
Division
|
Modulo (remainder)
|
Type of the JSON item (see
|
Size of the JSON item (number of array elements, or 1 if not an array)
|
Approximate floating-point number converted from a JSON number or string
|
Nearest integer greater than or equal to the given number
|
Nearest integer less than or equal to the given number
|
Absolute value of the given number
|
Date/time value converted from a string
|
Date/time value converted from a string using the
specified
|
The object's key-value pairs, represented as an array of objects
containing three fields:
|
The result type of the datetime()
and
datetime(
methods can be template
)date
, timetz
, time
,
timestamptz
, or timestamp
.
Both methods determine their result type dynamically.
The datetime()
method sequentially tries to
match its input string to the ISO formats
for date
, timetz
, time
,
timestamptz
, and timestamp
. It stops on
the first matching format and emits the corresponding data type.
The datetime(
method determines the result type according to the fields used in the
provided template string.
template
)
The datetime()
and
datetime(
methods
use the same parsing rules as the template
)to_timestamp
SQL
function does (see Section 9.8), with three
exceptions. First, these methods don't allow unmatched template
patterns. Second, only the following separators are allowed in the
template string: minus sign, period, solidus (slash), comma, apostrophe,
semicolon, colon and space. Third, separators in the template string
must exactly match the input string.
If different date/time types need to be compared, an implicit cast is
applied. A date
value can be cast to timestamp
or timestamptz
, timestamp
can be cast to
timestamptz
, and time
to timetz
.
However, all but the first of these conversions depend on the current
TimeZone setting, and thus can only be performed
within timezone-aware jsonpath
functions.
Table 9.49 shows the available filter expression elements.
Table 9.49. jsonpath
Filter Expression Elements
Predicate/Value Description Example(s) |
---|
Equality comparison (this, and the other comparison operators, work on all JSON scalar values)
|
Non-equality comparison
|
Less-than comparison
|
Less-than-or-equal-to comparison
|
Greater-than comparison
|
Greater-than-or-equal-to comparison
|
JSON constant
|
JSON constant
|
JSON constant
|
Boolean AND
|
Boolean OR
|
Boolean NOT
|
Tests whether a Boolean condition is
|
Tests whether the first operand matches the regular expression
given by the second operand, optionally with modifications
described by a string of
|
Tests whether the second operand is an initial substring of the first operand.
|
Tests whether a path expression matches at least one SQL/JSON item.
Returns
|
SQL/JSON path expressions allow matching text to a regular expression
with the like_regex
filter. For example, the
following SQL/JSON path query would case-insensitively match all
strings in an array that start with an English vowel:
$[*] ? (@ like_regex "^[aeiou]" flag "i")
The optional flag
string may include one or more of
the characters
i
for case-insensitive match,
m
to allow ^
and $
to match at newlines,
s
to allow .
to match a newline,
and q
to quote the whole pattern (reducing the
behavior to a simple substring match).
The SQL/JSON standard borrows its definition for regular expressions
from the LIKE_REGEX
operator, which in turn uses the
XQuery standard. PostgreSQL does not currently support the
LIKE_REGEX
operator. Therefore,
the like_regex
filter is implemented using the
POSIX regular expression engine described in
Section 9.7.3. This leads to various minor
discrepancies from standard SQL/JSON behavior, which are cataloged in
Section 9.7.3.8.
Note, however, that the flag-letter incompatibilities described there
do not apply to SQL/JSON, as it translates the XQuery flag letters to
match what the POSIX engine expects.
Keep in mind that the pattern argument of like_regex
is a JSON path string literal, written according to the rules given in
Section 8.14.7. This means in particular that any
backslashes you want to use in the regular expression must be doubled.
For example, to match string values of the root document that contain
only digits:
$.* ? (@ like_regex "^\\d+$")
This section describes functions for operating on sequence objects, also called sequence generators or just sequences. Sequence objects are special single-row tables created with CREATE SEQUENCE. Sequence objects are commonly used to generate unique identifiers for rows of a table. The sequence functions, listed in Table 9.50, provide simple, multiuser-safe methods for obtaining successive sequence values from sequence objects.
Table 9.50. Sequence Functions
Function Description |
---|
Advances the sequence object to its next value and returns that value.
This is done atomically: even if multiple sessions
execute
This function requires |
Sets the sequence object's current value, and optionally
its SELECT setval('myseq', 42); Next
The result returned by
This function requires |
Returns the value most recently obtained
by
This function requires |
Returns the value most recently returned by
This function requires |
To avoid blocking concurrent transactions that obtain numbers from
the same sequence, the value obtained by nextval
is not reclaimed for re-use if the calling transaction later aborts.
This means that transaction aborts or database crashes can result in
gaps in the sequence of assigned values. That can happen without a
transaction abort, too. For example an INSERT
with
an ON CONFLICT
clause will compute the to-be-inserted
tuple, including doing any required nextval
calls, before detecting any conflict that would cause it to follow
the ON CONFLICT
rule instead.
Thus, PostgreSQL sequence
objects cannot be used to obtain “gapless”
sequences.
Likewise, sequence state changes made by setval
are immediately visible to other transactions, and are not undone if
the calling transaction rolls back.
If the database cluster crashes before committing a transaction
containing a nextval
or setval
call, the sequence state change might
not have made its way to persistent storage, so that it is uncertain
whether the sequence will have its original or updated state after the
cluster restarts. This is harmless for usage of the sequence within
the database, since other effects of uncommitted transactions will not
be visible either. However, if you wish to use a sequence value for
persistent outside-the-database purposes, make sure that the
nextval
call has been committed before doing so.
The sequence to be operated on by a sequence function is specified by
a regclass
argument, which is simply the OID of the sequence in the
pg_class
system catalog. You do not have to look up the
OID by hand, however, since the regclass
data type's input
converter will do the work for you. See Section 8.19
for details.
This section describes the SQL-compliant conditional expressions available in PostgreSQL.
If your needs go beyond the capabilities of these conditional expressions, you might want to consider writing a server-side function in a more expressive programming language.
Although COALESCE
, GREATEST
, and
LEAST
are syntactically similar to functions, they are
not ordinary functions, and thus cannot be used with explicit
VARIADIC
array arguments.
CASE
The SQL CASE
expression is a
generic conditional expression, similar to if/else statements in
other programming languages:
CASE WHENcondition
THENresult
[WHEN ...] [ELSEresult
] END
CASE
clauses can be used wherever
an expression is valid. Each condition
is an
expression that returns a boolean
result. If the condition's
result is true, the value of the CASE
expression is the
result
that follows the condition, and the
remainder of the CASE
expression is not processed. If the
condition's result is not true, any subsequent WHEN
clauses
are examined in the same manner. If no WHEN
condition
yields true, the value of the
CASE
expression is the result
of the
ELSE
clause. If the ELSE
clause is
omitted and no condition is true, the result is null.
An example:
SELECT * FROM test; a --- 1 2 3 SELECT a, CASE WHEN a=1 THEN 'one' WHEN a=2 THEN 'two' ELSE 'other' END FROM test; a | case ---+------- 1 | one 2 | two 3 | other
The data types of all the result
expressions must be convertible to a single output type.
See Section 10.5 for more details.
There is a “simple” form of CASE
expression
that is a variant of the general form above:
CASEexpression
WHENvalue
THENresult
[WHEN ...] [ELSEresult
] END
The first
expression
is computed, then compared to
each of the value
expressions in the
WHEN
clauses until one is found that is equal to it. If
no match is found, the result
of the
ELSE
clause (or a null value) is returned. This is similar
to the switch
statement in C.
The example above can be written using the simple
CASE
syntax:
SELECT a, CASE a WHEN 1 THEN 'one' WHEN 2 THEN 'two' ELSE 'other' END FROM test; a | case ---+------- 1 | one 2 | two 3 | other
A CASE
expression does not evaluate any subexpressions
that are not needed to determine the result. For example, this is a
possible way of avoiding a division-by-zero failure:
SELECT ... WHERE CASE WHEN x <> 0 THEN y/x > 1.5 ELSE false END;
As described in Section 4.2.14, there are various
situations in which subexpressions of an expression are evaluated at
different times, so that the principle that “CASE
evaluates only necessary subexpressions” is not ironclad. For
example a constant 1/0
subexpression will usually result in
a division-by-zero failure at planning time, even if it's within
a CASE
arm that would never be entered at run time.
COALESCE
COALESCE
(value
[, ...])
The COALESCE
function returns the first of its
arguments that is not null. Null is returned only if all arguments
are null. It is often used to substitute a default value for
null values when data is retrieved for display, for example:
SELECT COALESCE(description, short_description, '(none)') ...
This returns description
if it is not null, otherwise
short_description
if it is not null, otherwise (none)
.
The arguments must all be convertible to a common data type, which will be the type of the result (see Section 10.5 for details).
Like a CASE
expression, COALESCE
only
evaluates the arguments that are needed to determine the result;
that is, arguments to the right of the first non-null argument are
not evaluated. This SQL-standard function provides capabilities similar
to NVL
and IFNULL
, which are used in some other
database systems.
NULLIF
NULLIF
(value1
,value2
)
The NULLIF
function returns a null value if
value1
equals value2
;
otherwise it returns value1
.
This can be used to perform the inverse operation of the
COALESCE
example given above:
SELECT NULLIF(value, '(none)') ...
In this example, if value
is (none)
,
null is returned, otherwise the value of value
is returned.
The two arguments must be of comparable types.
To be specific, they are compared exactly as if you had
written
, so there must be a
suitable value1
= value2
=
operator available.
The result has the same type as the first argument — but there is
a subtlety. What is actually returned is the first argument of the
implied =
operator, and in some cases that will have
been promoted to match the second argument's type. For
example, NULLIF(1, 2.2)
yields numeric
,
because there is no integer
=
numeric
operator,
only numeric
=
numeric
.
GREATEST
and LEAST
GREATEST
(value
[, ...])
LEAST
(value
[, ...])
The GREATEST
and LEAST
functions select the
largest or smallest value from a list of any number of expressions.
The expressions must all be convertible to a common data type, which
will be the type of the result
(see Section 10.5 for details). NULL values
in the list are ignored. The result will be NULL only if all the
expressions evaluate to NULL.
Note that GREATEST
and LEAST
are not in
the SQL standard, but are a common extension. Some other databases
make them return NULL if any argument is NULL, rather than only when
all are NULL.
Table 9.51 shows the specialized operators available for array types. In addition to those, the usual comparison operators shown in Table 9.1 are available for arrays. The comparison operators compare the array contents element-by-element, using the default B-tree comparison function for the element data type, and sort based on the first difference. In multidimensional arrays the elements are visited in row-major order (last subscript varies most rapidly). If the contents of two arrays are equal but the dimensionality is different, the first difference in the dimensionality information determines the sort order.
Table 9.51. Array Operators
Operator Description Example(s) |
---|
Does the first array contain the second, that is, does each element
appearing in the second array equal some element of the first array?
(Duplicates are not treated specially,
thus
|
Is the first array contained by the second?
|
Do the arrays overlap, that is, have any elements in common?
|
Concatenates the two arrays. Concatenating a null or empty array is a no-op; otherwise the arrays must have the same number of dimensions (as illustrated by the first example) or differ in number of dimensions by one (as illustrated by the second). If the arrays are not of identical element types, they will be coerced to a common type (see Section 10.5).
|
Concatenates an element onto the front of an array (which must be empty or one-dimensional).
|
Concatenates an element onto the end of an array (which must be empty or one-dimensional).
|
See Section 8.15 for more details about array operator behavior. See Section 11.2 for more details about which operators support indexed operations.
Table 9.52 shows the functions available for use with array types. See Section 8.15 for more information and examples of the use of these functions.
Table 9.52. Array Functions
Function Description Example(s) |
---|
Appends an element to the end of an array (same as
the
|
Concatenates two arrays (same as
the
|
Returns a text representation of the array's dimensions.
|
Returns an array filled with copies of the given value, having
dimensions of the lengths specified by the second argument.
The optional third argument supplies lower-bound values for each
dimension (which default to all
|
Returns the length of the requested array dimension. (Produces NULL instead of 0 for empty or missing array dimensions.)
|
Returns the lower bound of the requested array dimension.
|
Returns the number of dimensions of the array.
|
Returns the subscript of the first occurrence of the second argument
in the array, or
|
Returns an array of the subscripts of all occurrences of the second
argument in the array given as first argument.
The array must be one-dimensional.
Comparisons are done using
|
Prepends an element to the beginning of an array (same as
the
|
Removes all elements equal to the given value from the array.
The array must be one-dimensional.
Comparisons are done using
|
Replaces each array element equal to the second argument with the third argument.
|
Converts each array element to its text representation, and
concatenates those separated by
the
|
Returns the upper bound of the requested array dimension.
|
Returns the total number of elements in the array, or 0 if the array is empty.
|
Trims an array by removing the last
|
Expands an array into a set of rows. The array's elements are read out in storage order.
1 2
foo bar baz quux
|
Expands multiple arrays (possibly of different data types) into a set of
rows. If the arrays are not all the same length then the shorter ones
are padded with
a | b ---+----- 1 | foo 2 | bar | baz
|
There are two differences in the behavior of string_to_array
from pre-9.1 versions of PostgreSQL.
First, it will return an empty (zero-element) array rather
than NULL
when the input string is of zero length.
Second, if the delimiter string is NULL
, the function
splits the input into individual characters, rather than
returning NULL
as before.
See also Section 9.21 about the aggregate
function array_agg
for use with arrays.
See Section 8.17 for an overview of range types.
Table 9.53 shows the specialized operators available for range types. Table 9.54 shows the specialized operators available for multirange types. In addition to those, the usual comparison operators shown in Table 9.1 are available for range and multirange types. The comparison operators order first by the range lower bounds, and only if those are equal do they compare the upper bounds. The multirange operators compare each range until one is unequal. This does not usually result in a useful overall ordering, but the operators are provided to allow unique indexes to be constructed on ranges.
Table 9.53. Range Operators
Operator Description Example(s) |
---|
Does the first range contain the second?
|
Does the range contain the element?
|
Is the first range contained by the second?
|
Is the element contained in the range?
|
Do the ranges overlap, that is, have any elements in common?
|
Is the first range strictly left of the second?
|
Is the first range strictly right of the second?
|
Does the first range not extend to the right of the second?
|
Does the first range not extend to the left of the second?
|
Are the ranges adjacent?
|
Computes the union of the ranges. The ranges must overlap or be
adjacent, so that the union is a single range (but
see
|
Computes the intersection of the ranges.
|
Computes the difference of the ranges. The second range must not be contained in the first in such a way that the difference would not be a single range.
|
Table 9.54. Multirange Operators
Operator Description Example(s) |
---|
Does the first multirange contain the second?
|
Does the multirange contain the range?
|
Does the multirange contain the element?
|
Does the range contain the multirange?
|
Is the first multirange contained by the second?
|
Is the multirange contained by the range?
|
Is the range contained by the multirange?
|
Is the element contained by the multirange?
|
Do the multiranges overlap, that is, have any elements in common?
|
Does the multirange overlap the range?
|
Does the range overlap the multirange?
|
Is the first multirange strictly left of the second?
|
Is the multirange strictly left of the range?
|
Is the range strictly left of the multirange?
|
Is the first multirange strictly right of the second?
|
Is the multirange strictly right of the range?
|
Is the range strictly right of the multirange?
|
Does the first multirange not extend to the right of the second?
|
Does the multirange not extend to the right of the range?
|
Does the range not extend to the right of the multirange?
|
Does the first multirange not extend to the left of the second?
|
Does the multirange not extend to the left of the range?
|
Does the range not extend to the left of the multirange?
|
Are the multiranges adjacent?
|
Is the multirange adjacent to the range?
|
Is the range adjacent to the multirange?
|
Computes the union of the multiranges. The multiranges need not overlap or be adjacent.
|
Computes the intersection of the multiranges.
|
Computes the difference of the multiranges.
|
The left-of/right-of/adjacent operators always return false when an empty range or multirange is involved; that is, an empty range is not considered to be either before or after any other range.
Elsewhere empty ranges and multiranges are treated as the additive identity: anything unioned with an empty value is itself. Anything minus an empty value is itself. An empty multirange has exactly the same points as an empty range. Every range contains the empty range. Every multirange contains as many empty ranges as you like.
The range union and difference operators will fail if the resulting range would need to contain two disjoint sub-ranges, as such a range cannot be represented. There are separate operators for union and difference that take multirange parameters and return a multirange, and they do not fail even if their arguments are disjoint. So if you need a union or difference operation for ranges that may be disjoint, you can avoid errors by first casting your ranges to multiranges.
Table 9.55 shows the functions available for use with range types. Table 9.56 shows the functions available for use with multirange types.
Table 9.55. Range Functions
Table 9.56. Multirange Functions
The lower_inc
, upper_inc
,
lower_inf
, and upper_inf
functions all return false for an empty range or multirange.
Aggregate functions compute a single result from a set of input values. The built-in general-purpose aggregate functions are listed in Table 9.57 while statistical aggregates are in Table 9.58. The built-in within-group ordered-set aggregate functions are listed in Table 9.59 while the built-in within-group hypothetical-set ones are in Table 9.60. Grouping operations, which are closely related to aggregate functions, are listed in Table 9.61. The special syntax considerations for aggregate functions are explained in Section 4.2.7. Consult Section 2.7 for additional introductory information.
Aggregate functions that support Partial Mode are eligible to participate in various optimizations, such as parallel aggregation.
Table 9.57. General-Purpose Aggregate Functions
Function Description | Partial Mode |
---|---|
Collects all the input values, including nulls, into an array. | No |
Concatenates all the input arrays into an array of one higher dimension. (The inputs must all have the same dimensionality, and cannot be empty or null.) | No |
Computes the average (arithmetic mean) of all the non-null input values. | Yes |
Computes the bitwise AND of all non-null input values. | Yes |
Computes the bitwise OR of all non-null input values. | Yes |
Computes the bitwise exclusive OR of all non-null input values. Can be useful as a checksum for an unordered set of values. | Yes |
Returns true if all non-null input values are true, otherwise false. | Yes |
Returns true if any non-null input value is true, otherwise false. | Yes |
Computes the number of input rows. | Yes |
Computes the number of input rows in which the input value is not null. | Yes |
This is the SQL standard's equivalent to | Yes |
Collects all the input values, including nulls, into a JSON array.
Values are converted to JSON as per | No |
Collects all the key/value pairs into a JSON object. Key arguments
are coerced to text; value arguments are converted as
per | No |
Computes the maximum of the non-null input
values. Available for any numeric, string, date/time, or enum type,
as well as | Yes |
Computes the minimum of the non-null input
values. Available for any numeric, string, date/time, or enum type,
as well as | Yes |
Computes the union of the non-null input values. | No |
Computes the intersection of the non-null input values. | No |
Concatenates the non-null input values into a string. Each value
after the first is preceded by the
corresponding | No |
Computes the sum of the non-null input values. | Yes |
Concatenates the non-null XML input values (see Section 9.15.1.7). | No |
It should be noted that except for count
,
these functions return a null value when no rows are selected. In
particular, sum
of no rows returns null, not
zero as one might expect, and array_agg
returns null rather than an empty array when there are no input
rows. The coalesce
function can be used to
substitute zero or an empty array for null when necessary.
The aggregate functions array_agg
,
json_agg
, jsonb_agg
,
json_object_agg
, jsonb_object_agg
,
string_agg
,
and xmlagg
, as well as similar user-defined
aggregate functions, produce meaningfully different result values
depending on the order of the input values. This ordering is
unspecified by default, but can be controlled by writing an
ORDER BY
clause within the aggregate call, as shown in
Section 4.2.7.
Alternatively, supplying the input values from a sorted subquery
will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
Beware that this approach can fail if the outer query level contains additional processing, such as a join, because that might cause the subquery's output to be reordered before the aggregate is computed.
The boolean aggregates bool_and
and
bool_or
correspond to the standard SQL aggregates
every
and any
or
some
.
PostgreSQL
supports every
, but not any
or some
, because there is an ambiguity built into
the standard syntax:
SELECT b1 = ANY((SELECT b2 FROM t2 ...)) FROM t1 ...;
Here ANY
can be considered either as introducing
a subquery, or as being an aggregate function, if the subquery
returns one row with a Boolean value.
Thus the standard name cannot be given to these aggregates.
Users accustomed to working with other SQL database management
systems might be disappointed by the performance of the
count
aggregate when it is applied to the
entire table. A query like:
SELECT count(*) FROM sometable;
will require effort proportional to the size of the table: PostgreSQL will need to scan either the entire table or the entirety of an index that includes all rows in the table.
Table 9.58 shows
aggregate functions typically used in statistical analysis.
(These are separated out merely to avoid cluttering the listing
of more-commonly-used aggregates.) Functions shown as
accepting numeric_type
are available for all
the types smallint
, integer
,
bigint
, numeric
, real
,
and double precision
.
Where the description mentions
N
, it means the
number of input rows for which all the input expressions are non-null.
In all cases, null is returned if the computation is meaningless,
for example when N
is zero.
Table 9.58. Aggregate Functions for Statistics
Table 9.59 shows some
aggregate functions that use the ordered-set aggregate
syntax. These functions are sometimes referred to as “inverse
distribution” functions. Their aggregated input is introduced by
ORDER BY
, and they may also take a direct
argument that is not aggregated, but is computed only once.
All these functions ignore null values in their aggregated input.
For those that take a fraction
parameter, the
fraction value must be between 0 and 1; an error is thrown if not.
However, a null fraction
value simply produces a
null result.
Table 9.59. Ordered-Set Aggregate Functions
Each of the “hypothetical-set” aggregates listed in
Table 9.60 is associated with a
window function of the same name defined in
Section 9.22. In each case, the aggregate's result
is the value that the associated window function would have
returned for the “hypothetical” row constructed from
args
, if such a row had been added to the sorted
group of rows represented by the sorted_args
.
For each of these functions, the list of direct arguments
given in args
must match the number and types of
the aggregated arguments given in sorted_args
.
Unlike most built-in aggregates, these aggregates are not strict, that is
they do not drop input rows containing nulls. Null values sort according
to the rule specified in the ORDER BY
clause.
Table 9.60. Hypothetical-Set Aggregate Functions
Table 9.61. Grouping Operations
The grouping operations shown in
Table 9.61 are used in conjunction with
grouping sets (see Section 7.2.4) to distinguish
result rows. The arguments to the GROUPING
function
are not actually evaluated, but they must exactly match expressions given
in the GROUP BY
clause of the associated query level.
For example:
=>
SELECT * FROM items_sold;
make | model | sales -------+-------+------- Foo | GT | 10 Foo | Tour | 20 Bar | City | 15 Bar | Sport | 5 (4 rows)=>
SELECT make, model, GROUPING(make,model), sum(sales) FROM items_sold GROUP BY ROLLUP(make,model);
make | model | grouping | sum -------+-------+----------+----- Foo | GT | 0 | 10 Foo | Tour | 0 | 20 Bar | City | 0 | 15 Bar | Sport | 0 | 5 Foo | | 1 | 30 Bar | | 1 | 20 | | 3 | 50 (7 rows)
Here, the grouping
value 0
in the
first four rows shows that those have been grouped normally, over both the
grouping columns. The value 1
indicates
that model
was not grouped by in the next-to-last two
rows, and the value 3
indicates that
neither make
nor model
was grouped
by in the last row (which therefore is an aggregate over all the input
rows).
Window functions provide the ability to perform calculations across sets of rows that are related to the current query row. See Section 3.5 for an introduction to this feature, and Section 4.2.8 for syntax details.
The built-in window functions are listed in
Table 9.62. Note that these functions
must be invoked using window function syntax, i.e., an
OVER
clause is required.
In addition to these functions, any built-in or user-defined
ordinary aggregate (i.e., not ordered-set or hypothetical-set aggregates)
can be used as a window function; see
Section 9.21 for a list of the built-in aggregates.
Aggregate functions act as window functions only when an OVER
clause follows the call; otherwise they act as plain aggregates
and return a single row for the entire set.
Table 9.62. General-Purpose Window Functions
All of the functions listed in
Table 9.62 depend on the sort ordering
specified by the ORDER BY
clause of the associated window
definition. Rows that are not distinct when considering only the
ORDER BY
columns are said to be peers.
The four ranking functions (including cume_dist
) are
defined so that they give the same answer for all rows of a peer group.
Note that first_value
, last_value
, and
nth_value
consider only the rows within the “window
frame”, which by default contains the rows from the start of the
partition through the last peer of the current row. This is
likely to give unhelpful results for last_value
and
sometimes also nth_value
. You can redefine the frame by
adding a suitable frame specification (RANGE
,
ROWS
or GROUPS
) to
the OVER
clause.
See Section 4.2.8 for more information
about frame specifications.
When an aggregate function is used as a window function, it aggregates
over the rows within the current row's window frame.
An aggregate used with ORDER BY
and the default window frame
definition produces a “running sum” type of behavior, which may or
may not be what's wanted. To obtain
aggregation over the whole partition, omit ORDER BY
or use
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
.
Other frame specifications can be used to obtain other effects.
The SQL standard defines a RESPECT NULLS
or
IGNORE NULLS
option for lead
, lag
,
first_value
, last_value
, and
nth_value
. This is not implemented in
PostgreSQL: the behavior is always the
same as the standard's default, namely RESPECT NULLS
.
Likewise, the standard's FROM FIRST
or FROM LAST
option for nth_value
is not implemented: only the
default FROM FIRST
behavior is supported. (You can achieve
the result of FROM LAST
by reversing the ORDER BY
ordering.)
This section describes the SQL-compliant subquery expressions available in PostgreSQL. All of the expression forms documented in this section return Boolean (true/false) results.
EXISTS
EXISTS (subquery
)
The argument of EXISTS
is an arbitrary SELECT
statement,
or subquery. The
subquery is evaluated to determine whether it returns any rows.
If it returns at least one row, the result of EXISTS
is
“true”; if the subquery returns no rows, the result of EXISTS
is “false”.
The subquery can refer to variables from the surrounding query, which will act as constants during any one evaluation of the subquery.
The subquery will generally only be executed long enough to determine whether at least one row is returned, not all the way to completion. It is unwise to write a subquery that has side effects (such as calling sequence functions); whether the side effects occur might be unpredictable.
Since the result depends only on whether any rows are returned,
and not on the contents of those rows, the output list of the
subquery is normally unimportant. A common coding convention is
to write all EXISTS
tests in the form
EXISTS(SELECT 1 WHERE ...)
. There are exceptions to
this rule however, such as subqueries that use INTERSECT
.
This simple example is like an inner join on col2
, but
it produces at most one output row for each tab1
row,
even if there are several matching tab2
rows:
SELECT col1 FROM tab1 WHERE EXISTS (SELECT 1 FROM tab2 WHERE col2 = tab1.col2);
IN
expression
IN (subquery
)
The right-hand side is a parenthesized
subquery, which must return exactly one column. The left-hand expression
is evaluated and compared to each row of the subquery result.
The result of IN
is “true” if any equal subquery row is found.
The result is “false” if no equal row is found (including the
case where the subquery returns no rows).
Note that if the left-hand expression yields null, or if there are
no equal right-hand values and at least one right-hand row yields
null, the result of the IN
construct will be null, not false.
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
As with EXISTS
, it's unwise to assume that the subquery will
be evaluated completely.
row_constructor
IN (subquery
)
The left-hand side of this form of IN
is a row constructor,
as described in Section 4.2.13.
The right-hand side is a parenthesized
subquery, which must return exactly as many columns as there are
expressions in the left-hand row. The left-hand expressions are
evaluated and compared row-wise to each row of the subquery result.
The result of IN
is “true” if any equal subquery row is found.
The result is “false” if no equal row is found (including the
case where the subquery returns no rows).
As usual, null values in the rows are combined per
the normal rules of SQL Boolean expressions. Two rows are considered
equal if all their corresponding members are non-null and equal; the rows
are unequal if any corresponding members are non-null and unequal;
otherwise the result of that row comparison is unknown (null).
If all the per-row results are either unequal or null, with at least one
null, then the result of IN
is null.
NOT IN
expression
NOT IN (subquery
)
The right-hand side is a parenthesized
subquery, which must return exactly one column. The left-hand expression
is evaluated and compared to each row of the subquery result.
The result of NOT IN
is “true” if only unequal subquery rows
are found (including the case where the subquery returns no rows).
The result is “false” if any equal row is found.
Note that if the left-hand expression yields null, or if there are
no equal right-hand values and at least one right-hand row yields
null, the result of the NOT IN
construct will be null, not true.
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
As with EXISTS
, it's unwise to assume that the subquery will
be evaluated completely.
row_constructor
NOT IN (subquery
)
The left-hand side of this form of NOT IN
is a row constructor,
as described in Section 4.2.13.
The right-hand side is a parenthesized
subquery, which must return exactly as many columns as there are
expressions in the left-hand row. The left-hand expressions are
evaluated and compared row-wise to each row of the subquery result.
The result of NOT IN
is “true” if only unequal subquery rows
are found (including the case where the subquery returns no rows).
The result is “false” if any equal row is found.
As usual, null values in the rows are combined per
the normal rules of SQL Boolean expressions. Two rows are considered
equal if all their corresponding members are non-null and equal; the rows
are unequal if any corresponding members are non-null and unequal;
otherwise the result of that row comparison is unknown (null).
If all the per-row results are either unequal or null, with at least one
null, then the result of NOT IN
is null.
ANY
/SOME
expression
operator
ANY (subquery
)expression
operator
SOME (subquery
)
The right-hand side is a parenthesized
subquery, which must return exactly one column. The left-hand expression
is evaluated and compared to each row of the subquery result using the
given operator
, which must yield a Boolean
result.
The result of ANY
is “true” if any true result is obtained.
The result is “false” if no true result is found (including the
case where the subquery returns no rows).
SOME
is a synonym for ANY
.
IN
is equivalent to = ANY
.
Note that if there are no successes and at least one right-hand row yields
null for the operator's result, the result of the ANY
construct
will be null, not false.
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
As with EXISTS
, it's unwise to assume that the subquery will
be evaluated completely.
row_constructor
operator
ANY (subquery
)row_constructor
operator
SOME (subquery
)
The left-hand side of this form of ANY
is a row constructor,
as described in Section 4.2.13.
The right-hand side is a parenthesized
subquery, which must return exactly as many columns as there are
expressions in the left-hand row. The left-hand expressions are
evaluated and compared row-wise to each row of the subquery result,
using the given operator
.
The result of ANY
is “true” if the comparison
returns true for any subquery row.
The result is “false” if the comparison returns false for every
subquery row (including the case where the subquery returns no
rows).
The result is NULL if no comparison with a subquery row returns true,
and at least one comparison returns NULL.
See Section 9.24.5 for details about the meaning of a row constructor comparison.
ALL
expression
operator
ALL (subquery
)
The right-hand side is a parenthesized
subquery, which must return exactly one column. The left-hand expression
is evaluated and compared to each row of the subquery result using the
given operator
, which must yield a Boolean
result.
The result of ALL
is “true” if all rows yield true
(including the case where the subquery returns no rows).
The result is “false” if any false result is found.
The result is NULL if no comparison with a subquery row returns false,
and at least one comparison returns NULL.
NOT IN
is equivalent to <> ALL
.
As with EXISTS
, it's unwise to assume that the subquery will
be evaluated completely.
row_constructor
operator
ALL (subquery
)
The left-hand side of this form of ALL
is a row constructor,
as described in Section 4.2.13.
The right-hand side is a parenthesized
subquery, which must return exactly as many columns as there are
expressions in the left-hand row. The left-hand expressions are
evaluated and compared row-wise to each row of the subquery result,
using the given operator
.
The result of ALL
is “true” if the comparison
returns true for all subquery rows (including the
case where the subquery returns no rows).
The result is “false” if the comparison returns false for any
subquery row.
The result is NULL if no comparison with a subquery row returns false,
and at least one comparison returns NULL.
See Section 9.24.5 for details about the meaning of a row constructor comparison.
row_constructor
operator
(subquery
)
The left-hand side is a row constructor, as described in Section 4.2.13. The right-hand side is a parenthesized subquery, which must return exactly as many columns as there are expressions in the left-hand row. Furthermore, the subquery cannot return more than one row. (If it returns zero rows, the result is taken to be null.) The left-hand side is evaluated and compared row-wise to the single subquery result row.
See Section 9.24.5 for details about the meaning of a row constructor comparison.
This section describes several specialized constructs for making multiple comparisons between groups of values. These forms are syntactically related to the subquery forms of the previous section, but do not involve subqueries. The forms involving array subexpressions are PostgreSQL extensions; the rest are SQL-compliant. All of the expression forms documented in this section return Boolean (true/false) results.
IN
expression
IN (value
[, ...])
The right-hand side is a parenthesized list of expressions. The result is “true” if the left-hand expression's result is equal to any of the right-hand expressions. This is a shorthand notation for
expression
=value1
ORexpression
=value2
OR ...
Note that if the left-hand expression yields null, or if there are
no equal right-hand values and at least one right-hand expression yields
null, the result of the IN
construct will be null, not false.
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
NOT IN
expression
NOT IN (value
[, ...])
The right-hand side is a parenthesized list of expressions. The result is “true” if the left-hand expression's result is unequal to all of the right-hand expressions. This is a shorthand notation for
expression
<>value1
ANDexpression
<>value2
AND ...
Note that if the left-hand expression yields null, or if there are
no equal right-hand values and at least one right-hand expression yields
null, the result of the NOT IN
construct will be null, not true
as one might naively expect.
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
x NOT IN y
is equivalent to NOT (x IN y)
in all
cases. However, null values are much more likely to trip up the novice when
working with NOT IN
than when working with IN
.
It is best to express your condition positively if possible.
ANY
/SOME
(array)expression
operator
ANY (array expression
)expression
operator
SOME (array expression
)
The right-hand side is a parenthesized expression, which must yield an
array value.
The left-hand expression
is evaluated and compared to each element of the array using the
given operator
, which must yield a Boolean
result.
The result of ANY
is “true” if any true result is obtained.
The result is “false” if no true result is found (including the
case where the array has zero elements).
If the array expression yields a null array, the result of
ANY
will be null. If the left-hand expression yields null,
the result of ANY
is ordinarily null (though a non-strict
comparison operator could possibly yield a different result).
Also, if the right-hand array contains any null elements and no true
comparison result is obtained, the result of ANY
will be null, not false (again, assuming a strict comparison operator).
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
SOME
is a synonym for ANY
.
ALL
(array)expression
operator
ALL (array expression
)
The right-hand side is a parenthesized expression, which must yield an
array value.
The left-hand expression
is evaluated and compared to each element of the array using the
given operator
, which must yield a Boolean
result.
The result of ALL
is “true” if all comparisons yield true
(including the case where the array has zero elements).
The result is “false” if any false result is found.
If the array expression yields a null array, the result of
ALL
will be null. If the left-hand expression yields null,
the result of ALL
is ordinarily null (though a non-strict
comparison operator could possibly yield a different result).
Also, if the right-hand array contains any null elements and no false
comparison result is obtained, the result of ALL
will be null, not true (again, assuming a strict comparison operator).
This is in accordance with SQL's normal rules for Boolean combinations
of null values.
row_constructor
operator
row_constructor
Each side is a row constructor,
as described in Section 4.2.13.
The two row constructors must have the same number of fields.
The given operator
is applied to each pair
of corresponding fields. (Since the fields could be of different
types, this means that a different specific operator could be selected
for each pair.)
All the selected operators must be members of some B-tree operator
class, or be the negator of an =
member of a B-tree
operator class, meaning that row constructor comparison is only
possible when the operator
is
=
,
<>
,
<
,
<=
,
>
, or
>=
,
or has semantics similar to one of these.
The =
and <>
cases work slightly differently
from the others. Two rows are considered
equal if all their corresponding members are non-null and equal; the rows
are unequal if any corresponding members are non-null and unequal;
otherwise the result of the row comparison is unknown (null).
For the <
, <=
, >
and
>=
cases, the row elements are compared left-to-right,
stopping as soon as an unequal or null pair of elements is found.
If either of this pair of elements is null, the result of the
row comparison is unknown (null); otherwise comparison of this pair
of elements determines the result. For example,
ROW(1,2,NULL) < ROW(1,3,0)
yields true, not null, because the third pair of elements are not
considered.
Prior to PostgreSQL 8.2, the
<
, <=
, >
and >=
cases were not handled per SQL specification. A comparison like
ROW(a,b) < ROW(c,d)
was implemented as
a < c AND b < d
whereas the correct behavior is equivalent to
a < c OR (a = c AND b < d)
.
row_constructor
IS DISTINCT FROMrow_constructor
This construct is similar to a <>
row comparison,
but it does not yield null for null inputs. Instead, any null value is
considered unequal to (distinct from) any non-null value, and any two
nulls are considered equal (not distinct). Thus the result will
either be true or false, never null.
row_constructor
IS NOT DISTINCT FROMrow_constructor
This construct is similar to a =
row comparison,
but it does not yield null for null inputs. Instead, any null value is
considered unequal to (distinct from) any non-null value, and any two
nulls are considered equal (not distinct). Thus the result will always
be either true or false, never null.
record
operator
record
The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.24.5) or comparing a row constructor to the output of a subquery (as in Section 9.23). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.
Each side is evaluated and they are compared row-wise. Composite type
comparisons are allowed when the operator
is
=
,
<>
,
<
,
<=
,
>
or
>=
,
or has semantics similar to one of these. (To be specific, an operator
can be a row comparison operator if it is a member of a B-tree operator
class, or is the negator of the =
member of a B-tree operator
class.) The default behavior of the above operators is the same as for
IS [ NOT ] DISTINCT FROM
for row constructors (see
Section 9.24.5).
To support matching of rows which include elements without a default
B-tree operator class, the following operators are defined for composite
type comparison:
*=
,
*<>
,
*<
,
*<=
,
*>
, and
*>=
.
These operators compare the internal binary representation of the two
rows. Two rows might have a different binary representation even
though comparisons of the two rows with the equality operator is true.
The ordering of rows under these comparison operators is deterministic
but not otherwise meaningful. These operators are used internally
for materialized views and might be useful for other specialized
purposes such as replication and B-Tree deduplication (see Section 64.4.3). They are not intended to be
generally useful for writing queries, though.
This section describes functions that possibly return more than one row. The most widely used functions in this class are series generating functions, as detailed in Table 9.63 and Table 9.64. Other, more specialized set-returning functions are described elsewhere in this manual. See Section 7.2.1.4 for ways to combine multiple set-returning functions.
Table 9.63. Series Generating Functions
When step
is positive, zero rows are returned if
start
is greater than stop
.
Conversely, when step
is negative, zero rows are
returned if start
is less than stop
.
Zero rows are also returned if any input is NULL
.
It is an error
for step
to be zero. Some examples follow:
SELECT * FROM generate_series(2,4); generate_series ----------------- 2 3 4 (3 rows) SELECT * FROM generate_series(5,1,-2); generate_series ----------------- 5 3 1 (3 rows) SELECT * FROM generate_series(4,3); generate_series ----------------- (0 rows) SELECT generate_series(1.1, 4, 1.3); generate_series ----------------- 1.1 2.4 3.7 (3 rows) -- this example relies on the date-plus-integer operator: SELECT current_date + s.a AS dates FROM generate_series(0,14,7) AS s(a); dates ------------ 2004-02-05 2004-02-12 2004-02-19 (3 rows) SELECT * FROM generate_series('2008-03-01 00:00'::timestamp, '2008-03-04 12:00', '10 hours'); generate_series --------------------- 2008-03-01 00:00:00 2008-03-01 10:00:00 2008-03-01 20:00:00 2008-03-02 06:00:00 2008-03-02 16:00:00 2008-03-03 02:00:00 2008-03-03 12:00:00 2008-03-03 22:00:00 2008-03-04 08:00:00 (9 rows)
Table 9.64. Subscript Generating Functions
generate_subscripts
is a convenience function that generates
the set of valid subscripts for the specified dimension of the given
array.
Zero rows are returned for arrays that do not have the requested dimension,
or if any input is NULL
.
Some examples follow:
-- basic usage: SELECT generate_subscripts('{NULL,1,NULL,2}'::int[], 1) AS s; s --- 1 2 3 4 (4 rows) -- presenting an array, the subscript and the subscripted -- value requires a subquery: SELECT * FROM arrays; a -------------------- {-1,-2} {100,200,300} (2 rows) SELECT a AS array, s AS subscript, a[s] AS value FROM (SELECT generate_subscripts(a, 1) AS s, a FROM arrays) foo; array | subscript | value ---------------+-----------+------- {-1,-2} | 1 | -1 {-1,-2} | 2 | -2 {100,200,300} | 1 | 100 {100,200,300} | 2 | 200 {100,200,300} | 3 | 300 (5 rows) -- unnest a 2D array: CREATE OR REPLACE FUNCTION unnest2(anyarray) RETURNS SETOF anyelement AS $$ select $1[i][j] from generate_subscripts($1,1) g1(i), generate_subscripts($1,2) g2(j); $$ LANGUAGE sql IMMUTABLE; CREATE FUNCTION SELECT * FROM unnest2(ARRAY[[1,2],[3,4]]); unnest2 --------- 1 2 3 4 (4 rows)
When a function in the FROM
clause is suffixed
by WITH ORDINALITY
, a bigint
column is
appended to the function's output column(s), which starts from 1 and
increments by 1 for each row of the function's output.
This is most useful in the case of set returning
functions such as unnest()
.
-- set returning function WITH ORDINALITY: SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n); ls | n -----------------+---- pg_serial | 1 pg_twophase | 2 postmaster.opts | 3 pg_notify | 4 postgresql.conf | 5 pg_tblspc | 6 logfile | 7 base | 8 postmaster.pid | 9 pg_ident.conf | 10 global | 11 pg_xact | 12 pg_snapshots | 13 pg_multixact | 14 PG_VERSION | 15 pg_wal | 16 pg_hba.conf | 17 pg_stat_tmp | 18 pg_subtrans | 19 (19 rows)
Table 9.65 shows several functions that extract session and system information.
In addition to the functions listed in this section, there are a number of functions related to the statistics system that also provide system information. See Section 28.2.22 for more information.
Table 9.65. Session Information Functions
Function Description |
---|
Returns the name of the current database. (Databases are
called “catalogs” in the SQL standard,
so |
Returns the text of the currently executing query, as submitted by the client (which might contain more than one statement). |
This is equivalent to |
Returns the name of the schema that is first in the search path (or a null value if the search path is empty). This is the schema that will be used for any tables or other named objects that are created without specifying a target schema. |
Returns an array of the names of all schemas presently in the
effective search path, in their priority order. (Items in the current
search_path setting that do not correspond to
existing, searchable schemas are omitted.) If the Boolean argument
is |
Returns the user name of the current execution context. |
Returns the IP address of the current client,
or |
Returns the IP port number of the current client,
or |
Returns the IP address on which the server accepted the current
connection,
or |
Returns the IP port number on which the server accepted the current
connection,
or |
Returns the process ID of the server process attached to the current session. |
Returns an array of the process ID(s) of the sessions that are blocking the server process with the specified process ID from acquiring a lock, or an empty array if there is no such server process or it is not blocked.
One server process blocks another if it either holds a lock that
conflicts with the blocked process's lock request (hard block), or is
waiting for a lock that would conflict with the blocked process's lock
request and is ahead of it in the wait queue (soft block). When using
parallel queries the result always lists client-visible process IDs
(that is, Frequent calls to this function could have some impact on database performance, because it needs exclusive access to the lock manager's shared state for a short time. |
Returns the time when the server configuration files were last loaded. If the current session was alive at the time, this will be the time when the session itself re-read the configuration files (so the reading will vary a little in different sessions). Otherwise it is the time when the postmaster process re-read the configuration files. |
Returns the path name of the log file currently in use by the logging
collector. The path includes the log_directory
directory and the individual log file name. The result
is |
Returns the OID of the current session's temporary schema, or zero if it has none (because it has not created any temporary tables). |
Returns true if the given OID is the OID of another session's temporary schema. (This can be useful, for example, to exclude other sessions' temporary tables from a catalog display.) |
Returns true if a JIT compiler extension is
available (see Chapter 32) and the
jit configuration parameter is set to
|
Returns the set of names of asynchronous notification channels that the current session is listening to. |
Returns the fraction (0–1) of the asynchronous notification queue's maximum size that is currently occupied by notifications that are waiting to be processed. See LISTEN and NOTIFY for more information. |
Returns the time when the server started. |
Returns an array of the process ID(s) of the sessions that are blocking the server process with the specified process ID from acquiring a safe snapshot, or an empty array if there is no such server process or it is not blocked.
A session running a Frequent calls to this function could have some impact on database performance, because it needs access to the predicate lock manager's shared state for a short time. |
Returns the current nesting level of PostgreSQL triggers (0 if not called, directly or indirectly, from inside a trigger). |
Returns the session user's name. |
This is equivalent to |
Returns a string describing the PostgreSQL
server's version. You can also get this information from
server_version, or for a machine-readable
version use server_version_num. Software
developers should use |
current_catalog
,
current_role
,
current_schema
,
current_user
,
session_user
,
and user
have special syntactic status
in SQL: they must be called without trailing
parentheses. In PostgreSQL, parentheses can optionally be used with
current_schema
, but not with the others.
The session_user
is normally the user who initiated
the current database connection; but superusers can change this setting
with SET SESSION AUTHORIZATION.
The current_user
is the user identifier
that is applicable for permission checking. Normally it is equal
to the session user, but it can be changed with
SET ROLE.
It also changes during the execution of
functions with the attribute SECURITY DEFINER
.
In Unix parlance, the session user is the “real user” and
the current user is the “effective user”.
current_role
and user
are
synonyms for current_user
. (The SQL standard draws
a distinction between current_role
and current_user
, but PostgreSQL
does not, since it unifies users and roles into a single kind of entity.)
Table 9.66 lists functions that
allow querying object access privileges programmatically.
(See Section 5.7 for more information about
privileges.)
In these functions, the user whose privileges are being inquired about
can be specified by name or by OID
(pg_authid
.oid
), or if
the name is given as public
then the privileges of the
PUBLIC pseudo-role are checked. Also, the user
argument can be omitted entirely, in which case
the current_user
is assumed.
The object that is being inquired about can be specified either by name or
by OID, too. When specifying by name, a schema name can be included if
relevant.
The access privilege of interest is specified by a text string, which must
evaluate to one of the appropriate privilege keywords for the object's type
(e.g., SELECT
). Optionally, WITH GRANT
OPTION
can be added to a privilege type to test whether the
privilege is held with grant option. Also, multiple privilege types can be
listed separated by commas, in which case the result will be true if any of
the listed privileges is held. (Case of the privilege string is not
significant, and extra whitespace is allowed between but not within
privilege names.)
Some examples:
SELECT has_table_privilege('myschema.mytable', 'select'); SELECT has_table_privilege('joe', 'mytable', 'INSERT, SELECT WITH GRANT OPTION');
Table 9.66. Access Privilege Inquiry Functions
Function Description |
---|
Does user have privilege for any column of table?
This succeeds either if the privilege is held for the whole table, or
if there is a column-level grant of the privilege for at least one
column.
Allowable privilege types are
|
Does user have privilege for the specified table column?
This succeeds either if the privilege is held for the whole table, or
if there is a column-level grant of the privilege for the column.
The column can be specified by name or by attribute number
( |
Does user have privilege for database?
Allowable privilege types are
|
Does user have privilege for foreign-data wrapper?
The only allowable privilege type is |
Does user have privilege for function?
The only allowable privilege type is
When specifying a function by name rather than by OID, the allowed
input is the same as for the SELECT has_function_privilege('joeuser', 'myfunc(int, text)', 'execute');
|
Does user have privilege for language?
The only allowable privilege type is |
Does user have privilege for schema?
Allowable privilege types are
|
Does user have privilege for sequence?
Allowable privilege types are
|
Does user have privilege for foreign server?
The only allowable privilege type is |
Does user have privilege for table?
Allowable privilege types
are |
Does user have privilege for tablespace?
The only allowable privilege type is |
Does user have privilege for data type?
The only allowable privilege type is |
Does user have privilege for role?
Allowable privilege types are
|
Is row-level security active for the specified table in the context of the current user and current environment? |
Table 9.67 shows the operators
available for the aclitem
type, which is the catalog
representation of access privileges. See Section 5.7
for information about how to read access privilege values.
Table 9.67. aclitem
Operators
Table 9.68 shows some additional
functions to manage the aclitem
type.
Table 9.68. aclitem
Functions
Function Description |
---|
Constructs an |
Returns the |
Constructs an |
Table 9.69 shows functions that determine whether a certain object is visible in the current schema search path. For example, a table is said to be visible if its containing schema is in the search path and no table of the same name appears earlier in the search path. This is equivalent to the statement that the table can be referenced by name without explicit schema qualification. Thus, to list the names of all visible tables:
SELECT relname FROM pg_class WHERE pg_table_is_visible(oid);
For functions and operators, an object in the search path is said to be visible if there is no object of the same name and argument data type(s) earlier in the path. For operator classes and families, both the name and the associated index access method are considered.
Table 9.69. Schema Visibility Inquiry Functions
All these functions require object OIDs to identify the object to be
checked. If you want to test an object by name, it is convenient to use
the OID alias types (regclass
, regtype
,
regprocedure
, regoperator
, regconfig
,
or regdictionary
),
for example:
SELECT pg_type_is_visible('myschema.widget'::regtype);
Note that it would not make much sense to test a non-schema-qualified type name in this way — if the name can be recognized at all, it must be visible.
Table 9.70 lists functions that extract information from the system catalogs.
Table 9.70. System Catalog Information Functions
Function Description |
---|
Returns the SQL name for a data type that is identified by its type OID and possibly a type modifier. Pass NULL for the type modifier if no specific modifier is known. |
Returns a set of records describing the foreign key relationships
that exist within the PostgreSQL system
catalogs.
The |
Reconstructs the creating command for a constraint. (This is a decompiled reconstruction, not the original text of the command.) |
Decompiles the internal form of an expression stored in the system catalogs, such as the default value for a column. If the expression might contain Vars, specify the OID of the relation they refer to as the second parameter; if no Vars are expected, passing zero is sufficient. |
Reconstructs the creating command for a function or procedure.
(This is a decompiled reconstruction, not the original text
of the command.)
The result is a complete |
Reconstructs the argument list of a function or procedure, in the form
it would need to appear in within |
Reconstructs the argument list necessary to identify a function or
procedure, in the form it would need to appear in within commands such
as |
Reconstructs the |
Reconstructs the creating command for an index.
(This is a decompiled reconstruction, not the original text
of the command.) If |
Returns a set of records describing the SQL keywords recognized by the
server. The |
Reconstructs the creating command for a rule. (This is a decompiled reconstruction, not the original text of the command.) |
Returns the name of the sequence associated with a column,
or NULL if no sequence is associated with the column.
If the column is an identity column, the associated sequence is the
sequence internally created for that column.
For columns created using one of the serial types
( A typical use is in reading the current value of the sequence for an identity or serial column, for example: SELECT currval(pg_get_serial_sequence('sometable', 'id'));
|
Reconstructs the creating command for an extended statistics object. (This is a decompiled reconstruction, not the original text of the command.) |
Reconstructs the creating command for a trigger. (This is a decompiled reconstruction, not the original text of the command.) |
Returns a role's name given its OID. |
Reconstructs the underlying |
Reconstructs the underlying |
Reconstructs the underlying |
Tests whether an index column has the named property.
Common index column properties are listed in
Table 9.71.
(Note that extension access methods can define additional property
names for their indexes.)
|
Tests whether an index has the named property.
Common index properties are listed in
Table 9.72.
(Note that extension access methods can define additional property
names for their indexes.)
|
Tests whether an index access method has the named property.
Access method properties are listed in
Table 9.73.
|
Returns the set of storage options represented by a value from
|
Returns the set of OIDs of databases that have objects stored in the
specified tablespace. If this function returns any rows, the
tablespace is not empty and cannot be dropped. To identify the specific
objects populating the tablespace, you will need to connect to the
database(s) identified by |
Returns the file system path that this tablespace is located in. |
Returns the OID of the data type of the value that is passed to it.
This can be helpful for troubleshooting or dynamically constructing
SQL queries. The function is declared as
returning For example: SELECT pg_typeof(33); pg_typeof ----------- integer SELECT typlen FROM pg_type WHERE oid = pg_typeof(33); typlen -------- 4
|
Returns the name of the collation of the value that is passed to it.
The value is quoted and schema-qualified if necessary. If no
collation was derived for the argument expression,
then For example: SELECT collation for (description) FROM pg_description LIMIT 1; pg_collation_for ------------------ "default" SELECT collation for ('foo' COLLATE "de_DE"); pg_collation_for ------------------ "de_DE"
|
Translates a textual relation name to its OID. A similar result is
obtained by casting the string to type |
Translates a textual collation name to its OID. A similar result is
obtained by casting the string to type |
Translates a textual schema name to its OID. A similar result is
obtained by casting the string to type |
Translates a textual operator name to its OID. A similar result is
obtained by casting the string to type |
Translates a textual operator name (with parameter types) to its OID. A similar result is
obtained by casting the string to type |
Translates a textual function or procedure name to its OID. A similar result is
obtained by casting the string to type |
Translates a textual function or procedure name (with argument types) to its OID. A similar result is
obtained by casting the string to type |
Translates a textual role name to its OID. A similar result is
obtained by casting the string to type |
Translates a textual type name to its OID. A similar result is
obtained by casting the string to type |
Most of the functions that reconstruct (decompile) database objects
have an optional pretty
flag, which
if true
causes the result to
be “pretty-printed”. Pretty-printing suppresses unnecessary
parentheses and adds whitespace for legibility.
The pretty-printed format is more readable, but the default format
is more likely to be interpreted the same way by future versions of
PostgreSQL; so avoid using pretty-printed output
for dump purposes. Passing false
for
the pretty
parameter yields the same result as
omitting the parameter.
Table 9.71. Index Column Properties
Name | Description |
---|---|
asc | Does the column sort in ascending order on a forward scan? |
desc | Does the column sort in descending order on a forward scan? |
nulls_first | Does the column sort with nulls first on a forward scan? |
nulls_last | Does the column sort with nulls last on a forward scan? |
orderable | Does the column possess any defined sort ordering? |
distance_orderable | Can the column be scanned in order by a “distance”
operator, for example ORDER BY col <-> constant ?
|
returnable | Can the column value be returned by an index-only scan? |
search_array | Does the column natively support col = ANY(array)
searches?
|
search_nulls | Does the column support IS NULL and
IS NOT NULL searches?
|
Table 9.72. Index Properties
Name | Description |
---|---|
clusterable | Can the index be used in a CLUSTER command?
|
index_scan | Does the index support plain (non-bitmap) scans? |
bitmap_scan | Does the index support bitmap scans? |
backward_scan | Can the scan direction be changed in mid-scan (to
support FETCH BACKWARD on a cursor without
needing materialization)?
|
Table 9.73. Index Access Method Properties
Name | Description |
---|---|
can_order | Does the access method support ASC ,
DESC and related keywords in
CREATE INDEX ?
|
can_unique | Does the access method support unique indexes? |
can_multi_col | Does the access method support indexes with multiple columns? |
can_exclude | Does the access method support exclusion constraints? |
can_include | Does the access method support the INCLUDE
clause of CREATE INDEX ?
|
Table 9.74 lists functions related to database object identification and addressing.
Table 9.74. Object Information and Addressing Functions
The functions shown in Table 9.75 extract comments previously stored with the COMMENT command. A null value is returned if no comment could be found for the specified parameters.
Table 9.75. Comment Information Functions
The functions shown in Table 9.76 provide server transaction information in an exportable form. The main use of these functions is to determine which transactions were committed between two snapshots.
Table 9.76. Transaction ID and Snapshot Information Functions
Function Description |
---|
Returns the current transaction's ID. It will assign a new one if the current transaction does not have one already (because it has not performed any database updates). |
Returns the current transaction's ID, or |
Reports the commit status of a recent transaction.
The result is one of |
Returns a current snapshot, a data structure showing which transaction IDs are now in-progress. |
Returns the set of in-progress transaction IDs contained in a snapshot. |
Returns the |
Returns the |
Is the given transaction ID visible according to this snapshot (that is, was it completed before the snapshot was taken)? Note that this function will not give the correct answer for a subtransaction ID. |
The internal transaction ID type xid
is 32 bits wide and
wraps around every 4 billion transactions. However,
the functions shown in Table 9.76 use a
64-bit type xid8
that does not wrap around during the life
of an installation, and can be converted to xid
by casting if
required. The data type pg_snapshot
stores information about
transaction ID visibility at a particular moment in time. Its components
are described in Table 9.77.
pg_snapshot
's textual representation is
.
For example xmin
:xmax
:xip_list
10:20:10,14,15
means
xmin=10, xmax=20, xip_list=10, 14, 15
.
Table 9.77. Snapshot Components
Name | Description |
---|---|
xmin |
Lowest transaction ID that was still active. All transaction IDs
less than xmin are either committed and visible,
or rolled back and dead.
|
xmax |
One past the highest completed transaction ID. All transaction IDs
greater than or equal to xmax had not yet
completed as of the time of the snapshot, and thus are invisible.
|
xip_list |
Transactions in progress at the time of the snapshot. A transaction
ID that is xmin <= and not in this list was already completed at the time
of the snapshot, and thus is either visible or dead according to its
commit status. This list does not include the transaction IDs of
subtransactions.
|
In releases of PostgreSQL before 13 there was
no xid8
type, so variants of these functions were provided
that used bigint
to represent a 64-bit XID, with a
correspondingly distinct snapshot data type txid_snapshot
.
These older functions have txid
in their names. They
are still supported for backward compatibility, but may be removed from a
future release. See Table 9.78.
Table 9.78. Deprecated Transaction ID and Snapshot Information Functions
The functions shown in Table 9.79 provide information about when past transactions were committed. They only provide useful data when the track_commit_timestamp configuration option is enabled, and only for transactions that were committed after it was enabled.
Table 9.79. Committed Transaction Information Functions
The functions shown in Table 9.80
print information initialized during initdb
, such
as the catalog version. They also show information about write-ahead
logging and checkpoint processing. This information is cluster-wide,
not specific to any one database. These functions provide most of the same
information, from the same source, as the
pg_controldata application.
Table 9.80. Control Data Functions
Function Description |
---|
Returns information about current checkpoint state, as shown in Table 9.81. |
Returns information about current control file state, as shown in Table 9.82. |
Returns information about cluster initialization state, as shown in Table 9.83. |
Returns information about recovery state, as shown in Table 9.84. |
Table 9.81. pg_control_checkpoint
Output Columns
Column Name | Data Type |
---|---|
checkpoint_lsn | pg_lsn |
redo_lsn | pg_lsn |
redo_wal_file | text |
timeline_id | integer |
prev_timeline_id | integer |
full_page_writes | boolean |
next_xid | text |
next_oid | oid |
next_multixact_id | xid |
next_multi_offset | xid |
oldest_xid | xid |
oldest_xid_dbid | oid |
oldest_active_xid | xid |
oldest_multi_xid | xid |
oldest_multi_dbid | oid |
oldest_commit_ts_xid | xid |
newest_commit_ts_xid | xid |
checkpoint_time | timestamp with time zone |
Table 9.82. pg_control_system
Output Columns
Column Name | Data Type |
---|---|
pg_control_version | integer |
catalog_version_no | integer |
system_identifier | bigint |
pg_control_last_modified | timestamp with time zone |
Table 9.83. pg_control_init
Output Columns
Column Name | Data Type |
---|---|
max_data_alignment | integer |
database_block_size | integer |
blocks_per_segment | integer |
wal_block_size | integer |
bytes_per_wal_segment | integer |
max_identifier_length | integer |
max_index_columns | integer |
max_toast_chunk_size | integer |
large_object_chunk_size | integer |
float8_pass_by_value | boolean |
data_page_checksum_version | integer |
Table 9.84. pg_control_recovery
Output Columns
Column Name | Data Type |
---|---|
min_recovery_end_lsn | pg_lsn |
min_recovery_end_timeline | integer |
backup_start_lsn | pg_lsn |
backup_end_lsn | pg_lsn |
end_of_backup_record_required | boolean |
The functions described in this section are used to control and monitor a PostgreSQL installation.
Table 9.85 shows the functions available to query and alter run-time configuration parameters.
Table 9.85. Configuration Settings Functions
Function Description Example(s) |
---|
Returns the current value of the
setting
|
Sets the parameter
|
The functions shown in Table 9.86 send control signals to
other server processes. Use of these functions is restricted to
superusers by default but access may be granted to others using
GRANT
, with noted exceptions.
Each of these functions returns true
if
the signal was successfully sent and false
if sending the signal failed.
Table 9.86. Server Signaling Functions
Function Description |
---|
Cancels the current query of the session whose backend process has the
specified process ID. This is also allowed if the
calling role is a member of the role whose backend is being canceled or
the calling role has been granted |
Requests to log the memory contexts of the backend with the
specified process ID. These memory contexts will be logged at
|
Causes all processes of the PostgreSQL
server to reload their configuration files. (This is initiated by
sending a SIGHUP signal to the postmaster
process, which in turn sends SIGHUP to each
of its children.) You can use the
|
Signals the log-file manager to switch to a new output file immediately. This works only when the built-in log collector is running, since otherwise there is no log-file manager subprocess. |
Terminates the session whose backend process has the
specified process ID. This is also allowed if the calling role
is a member of the role whose backend is being terminated or the
calling role has been granted
If |
pg_cancel_backend
and pg_terminate_backend
send signals (SIGINT or SIGTERM
respectively) to backend processes identified by process ID.
The process ID of an active backend can be found from
the pid
column of the
pg_stat_activity
view, or by listing the
postgres
processes on the server (using
ps on Unix or the Task
Manager on Windows).
The role of an active backend can be found from the
usename
column of the
pg_stat_activity
view.
pg_log_backend_memory_contexts
can be used
to log the memory contexts of a backend process. For example:
postgres=# SELECT pg_log_backend_memory_contexts(pg_backend_pid()); pg_log_backend_memory_contexts -------------------------------- t (1 row)
One message for each memory context will be logged. For example:
LOG: logging memory contexts of PID 10377 STATEMENT: SELECT pg_log_backend_memory_contexts(pg_backend_pid()); LOG: level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 chunks); 66368 used LOG: level: 1; pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used LOG: level: 1; TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used LOG: level: 1; RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used LOG: level: 1; MessageContext: 16384 total in 2 blocks; 5152 free (0 chunks); 11232 used LOG: level: 1; Operator class cache: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used LOG: level: 1; smgr relation table: 16384 total in 2 blocks; 4544 free (3 chunks); 11840 used LOG: level: 1; TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used ... LOG: level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 264 used LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 used
If there are more than 100 child contexts under the same parent, the first 100 child contexts are logged, along with a summary of the remaining contexts. Note that frequent calls to this function could incur significant overhead, because it may generate a large number of log messages.
The functions shown in Table 9.87 assist in making on-line backups.
These functions cannot be executed during recovery (except
non-exclusive pg_start_backup
,
non-exclusive pg_stop_backup
,
pg_is_in_backup
, pg_backup_start_time
and pg_wal_lsn_diff
).
For details about proper usage of these functions, see Section 26.3.
Table 9.87. Backup Control Functions
Function Description |
---|
Creates a named marker record in the write-ahead log that can later be used as a recovery target, and returns the corresponding write-ahead log location. The given name can then be used with recovery_target_name to specify the point up to which recovery will proceed. Avoid creating multiple restore points with the same name, since recovery will stop at the first one whose name matches the recovery target. This function is restricted to superusers by default, but other users can be granted EXECUTE to run the function. |
Returns the current write-ahead log flush location (see notes below). |
Returns the current write-ahead log insert location (see notes below). |
Returns the current write-ahead log write location (see notes below). |
Prepares the server to begin an on-line backup. The only required
parameter is an arbitrary user-defined label for the backup.
(Typically this would be the name under which the backup dump file
will be stored.)
If the optional second parameter is given as
When used in exclusive mode, this function writes a backup label file
( This function is restricted to superusers by default, but other users can be granted EXECUTE to run the function. |
Finishes performing an exclusive or non-exclusive on-line backup.
The
There is an optional second parameter of type
When executed on a primary, this function also creates a backup
history file in the write-ahead log archive area. The history file
includes the label given to
The result of the function is a single record.
The This function is restricted to superusers by default, but other users can be granted EXECUTE to run the function. |
Finishes performing an exclusive on-line backup. This simplified
version is equivalent to This function is restricted to superusers by default, but other users can be granted EXECUTE to run the function. |
Returns true if an on-line exclusive backup is in progress. |
Returns the start time of the current on-line exclusive backup if one
is in progress, otherwise |
Forces the server to switch to a new write-ahead log file, which
allows the current file to be archived (assuming you are using
continuous archiving). The result is the ending write-ahead log
location plus 1 within the just-completed write-ahead log file. If
there has been no write-ahead log activity since the last write-ahead
log switch, This function is restricted to superusers by default, but other users can be granted EXECUTE to run the function. |
Converts a write-ahead log location to the name of the WAL file holding that location. |
Converts a write-ahead log location to a WAL file name and byte offset within that file. |
Calculates the difference in bytes ( |
pg_current_wal_lsn
displays the current write-ahead
log write location in the same format used by the above functions.
Similarly, pg_current_wal_insert_lsn
displays the
current write-ahead log insertion location
and pg_current_wal_flush_lsn
displays the current
write-ahead log flush location. The insertion location is
the “logical” end of the write-ahead log at any instant,
while the write location is the end of what has actually been written out
from the server's internal buffers, and the flush location is the last
location known to be written to durable storage. The write location is the
end of what can be examined from outside the server, and is usually what
you want if you are interested in archiving partially-complete write-ahead
log files. The insertion and flush locations are made available primarily
for server debugging purposes. These are all read-only operations and do
not require superuser permissions.
You can use pg_walfile_name_offset
to extract the
corresponding write-ahead log file name and byte offset from
a pg_lsn
value. For example:
postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); file_name | file_offset --------------------------+------------- 00000001000000000000000D | 4039624 (1 row)
Similarly, pg_walfile_name
extracts just the write-ahead log file name.
When the given write-ahead log location is exactly at a write-ahead log file boundary, both
these functions return the name of the preceding write-ahead log file.
This is usually the desired behavior for managing write-ahead log archiving
behavior, since the preceding file is the last one that currently
needs to be archived.
The functions shown in Table 9.88 provide information about the current status of a standby server. These functions may be executed both during recovery and in normal running.
Table 9.88. Recovery Information Functions
The functions shown in Table 9.89 control the progress of recovery. These functions may be executed only during recovery.
Table 9.89. Recovery Control Functions
pg_wal_replay_pause
and
pg_wal_replay_resume
cannot be executed while
a promotion is ongoing. If a promotion is triggered while recovery
is paused, the paused state ends and promotion continues.
If streaming replication is disabled, the paused state may continue indefinitely without a problem. If streaming replication is in progress then WAL records will continue to be received, which will eventually fill available disk space, depending upon the duration of the pause, the rate of WAL generation and available disk space.
PostgreSQL allows database sessions to synchronize their
snapshots. A snapshot determines which data is visible to the
transaction that is using the snapshot. Synchronized snapshots are
necessary when two or more sessions need to see identical content in the
database. If two sessions just start their transactions independently,
there is always a possibility that some third transaction commits
between the executions of the two START TRANSACTION
commands,
so that one session sees the effects of that transaction and the other
does not.
To solve this problem, PostgreSQL allows a transaction to export the snapshot it is using. As long as the exporting transaction remains open, other transactions can import its snapshot, and thereby be guaranteed that they see exactly the same view of the database that the first transaction sees. But note that any database changes made by any one of these transactions remain invisible to the other transactions, as is usual for changes made by uncommitted transactions. So the transactions are synchronized with respect to pre-existing data, but act normally for changes they make themselves.
Snapshots are exported with the pg_export_snapshot
function,
shown in Table 9.90, and
imported with the SET TRANSACTION command.
Table 9.90. Snapshot Synchronization Functions
Function Description |
---|
Saves the transaction's current snapshot and returns
a
A transaction can export more than one snapshot, if needed. Note that
doing so is only useful in |
The functions shown
in Table 9.91 are for
controlling and interacting with replication features.
See Section 27.2.5,
Section 27.2.6, and
Chapter 50
for information about the underlying features.
Use of functions for replication origin is only allowed to the
superuser by default, but may be allowed to other users by using the
GRANT
command.
Use of functions for replication slots is restricted to superusers
and users having REPLICATION
privilege.
Many of these functions have equivalent commands in the replication protocol; see Section 53.4.
The functions described in Section 9.27.3, Section 9.27.4, and Section 9.27.5 are also relevant for replication.
Table 9.91. Replication Management Functions
Function Description |
---|
Creates a new physical replication slot named
|
Drops the physical or logical replication slot
named |
Creates a new logical (decoding) replication slot named
|
Copies an existing physical replication slot named |
Copies an existing logical replication slot
named |
Returns changes in the slot |
Behaves just like
the |
Behaves just like
the |
Behaves just like
the |
Advances the current confirmed position of a replication slot named
|
Creates a replication origin with the given external name, and returns the internal ID assigned to it. |
Deletes a previously-created replication origin, including any associated replay progress. |
Looks up a replication origin by name and returns the internal ID. If
no such replication origin is found, |
Marks the current session as replaying from the given
origin, allowing replay progress to be tracked.
Can only be used if no origin is currently selected.
Use |
Cancels the effects
of |
Returns true if a replication origin has been selected in the current session. |
Returns the replay location for the replication origin selected in
the current session. The parameter |
Marks the current transaction as replaying a transaction that has
committed at the given LSN and timestamp. Can
only be called when a replication origin has been selected
using |
Cancels the effects of
|
Sets replication progress for the given node to the given location. This is primarily useful for setting up the initial location, or setting a new location after configuration changes and similar. Be aware that careless use of this function can lead to inconsistently replicated data. |
Returns the replay location for the given replication origin. The
parameter |
Emits a logical decoding message. This can be used to pass generic
messages to logical decoding plugins through
WAL. The |
The functions shown in Table 9.92 calculate
the disk space usage of database objects, or assist in presentation
or understanding of usage results. bigint
results
are measured in bytes. If an OID that does
not represent an existing object is passed to one of these
functions, NULL
is returned.
Table 9.92. Database Object Size Functions
Function Description |
---|
Shows the number of bytes used to store any individual data value. If applied directly to a table column value, this reflects any compression that was done. |
Shows the compression algorithm that was used to compress
an individual variable-length value. Returns |
Computes the total disk space used by the database with the specified
name or OID. To use this function, you must
have |
Computes the total disk space used by indexes attached to the specified table. |
Computes the disk space used by one “fork” of the
specified relation. (Note that for most purposes it is more
convenient to use the higher-level
functions
|
Converts a size in human-readable format (as returned
by |
Converts a size in bytes into a more easily human-readable format with size units (bytes, kB, MB, GB or TB as appropriate). Note that the units are powers of 2 rather than powers of 10, so 1kB is 1024 bytes, 1MB is 10242 = 1048576 bytes, and so on. |
Computes the disk space used by the specified table, excluding indexes (but including its TOAST table if any, free space map, and visibility map). |
Computes the total disk space used in the tablespace with the
specified name or OID. To use this function, you must
have |
Computes the total disk space used by the specified table, including
all indexes and TOAST data. The result is
equivalent to |
The functions above that operate on tables or indexes accept a
regclass
argument, which is simply the OID of the table or index
in the pg_class
system catalog. You do not have to look up
the OID by hand, however, since the regclass
data type's input
converter will do the work for you. See Section 8.19
for details.
The functions shown in Table 9.93 assist in identifying the specific disk files associated with database objects.
Table 9.93. Database Object Location Functions
Function Description |
---|
Returns the “filenode” number currently assigned to the
specified relation. The filenode is the base component of the file
name(s) used for the relation (see
Section 70.1 for more information).
For most relations the result is the same as
|
Returns the entire file path name (relative to the database cluster's
data directory, |
Returns a relation's OID given the tablespace OID and filenode it is
stored under. This is essentially the inverse mapping of
|
Table 9.94 lists functions used to manage collations.
Table 9.94. Collation Management Functions
Function Description |
---|
Returns the actual version of the collation object as it is currently
installed in the operating system. If this is different from the
value in
|
Adds collations to the system
catalog |
Table 9.95 lists functions that provide information about the structure of partitioned tables.
Table 9.95. Partitioning Information Functions
For example, to check the total size of the data contained in a
partitioned table measurement
, one could use the
following query:
SELECT pg_size_pretty(sum(pg_relation_size(relid))) AS total_size FROM pg_partition_tree('measurement');
Table 9.96 shows the functions available for index maintenance tasks. (Note that these maintenance tasks are normally done automatically by autovacuum; use of these functions is only required in special cases.) These functions cannot be executed during recovery. Use of these functions is restricted to superusers and the owner of the given index.
Table 9.96. Index Maintenance Functions
Function Description |
---|
Scans the specified BRIN index to find page ranges in the base table that are not currently summarized by the index; for any such range it creates a new summary index tuple by scanning those table pages. Returns the number of new page range summaries that were inserted into the index. |
Summarizes the page range covering the given block, if not already
summarized. This is
like |
Removes the BRIN index tuple that summarizes the page range covering the given table block, if there is one. |
Cleans up the “pending” list of the specified GIN index
by moving entries in it, in bulk, to the main GIN data structure.
Returns the number of pages removed from the pending list.
If the argument is a GIN index built with
the |
The functions shown in Table 9.97 provide native access to
files on the machine hosting the server. Only files within the
database cluster directory and the log_directory
can be
accessed, unless the user is a superuser or is granted the role
pg_read_server_files
. Use a relative path for files in
the cluster directory, and a path matching the log_directory
configuration setting for log files.
Note that granting users the EXECUTE privilege on
pg_read_file()
, or related functions, allows them the
ability to read any file on the server that the database server process can
read; these functions bypass all in-database privilege checks. This means
that, for example, a user with such access is able to read the contents of
the pg_authid
table where authentication
information is stored, as well as read any table data in the database.
Therefore, granting access to these functions should be carefully
considered.
Some of these functions take an optional missing_ok
parameter, which specifies the behavior when the file or directory does
not exist. If true
, the function
returns NULL
or an empty result set, as appropriate.
If false
, an error is raised. The default
is false
.
Table 9.97. Generic File Access Functions
The functions shown in Table 9.98 manage advisory locks. For details about proper use of these functions, see Section 13.3.5.
All these functions are intended to be used to lock application-defined
resources, which can be identified either by a single 64-bit key value or
two 32-bit key values (note that these two key spaces do not overlap).
If another session already holds a conflicting lock on the same resource
identifier, the functions will either wait until the resource becomes
available, or return a false
result, as appropriate for
the function.
Locks can be either shared or exclusive: a shared lock does not conflict
with other shared locks on the same resource, only with exclusive locks.
Locks can be taken at session level (so that they are held until released
or the session ends) or at transaction level (so that they are held until
the current transaction ends; there is no provision for manual release).
Multiple session-level lock requests stack, so that if the same resource
identifier is locked three times there must then be three unlock requests
to release the resource in advance of session end.
Table 9.98. Advisory Lock Functions
While many uses of triggers involve user-written trigger functions, PostgreSQL provides a few built-in trigger functions that can be used directly in user-defined triggers. These are summarized in Table 9.99. (Additional built-in trigger functions exist, which implement foreign key constraints and deferred index constraints. Those are not documented here since users need not use them directly.)
For more information about creating triggers, see CREATE TRIGGER.
Table 9.99. Built-In Trigger Functions
Function Description Example Usage |
---|
Suppresses do-nothing update operations. See below for details.
|
Automatically updates a
|
Automatically updates a
|
The suppress_redundant_updates_trigger
function,
when applied as a row-level BEFORE UPDATE
trigger,
will prevent any update that does not actually change the data in the
row from taking place. This overrides the normal behavior which always
performs a physical row update
regardless of whether or not the data has changed. (This normal behavior
makes updates run faster, since no checking is required, and is also
useful in certain cases.)
Ideally, you should avoid running updates that don't actually
change the data in the record. Redundant updates can cost considerable
unnecessary time, especially if there are lots of indexes to alter,
and space in dead rows that will eventually have to be vacuumed.
However, detecting such situations in client code is not
always easy, or even possible, and writing expressions to detect
them can be error-prone. An alternative is to use
suppress_redundant_updates_trigger
, which will skip
updates that don't change the data. You should use this with care,
however. The trigger takes a small but non-trivial time for each record,
so if most of the records affected by updates do actually change,
use of this trigger will make updates run slower on average.
The suppress_redundant_updates_trigger
function can be
added to a table like this:
CREATE TRIGGER z_min_update BEFORE UPDATE ON tablename FOR EACH ROW EXECUTE FUNCTION suppress_redundant_updates_trigger();
In most cases, you need to fire this trigger last for each row, so that it does not override other triggers that might wish to alter the row. Bearing in mind that triggers fire in name order, you would therefore choose a trigger name that comes after the name of any other trigger you might have on the table. (Hence the “z” prefix in the example.)
PostgreSQL provides these helper functions to retrieve information from event triggers.
For more information about event triggers, see Chapter 40.
pg_event_trigger_ddl_commands
() →setof record
pg_event_trigger_ddl_commands
returns a list of
DDL commands executed by each user action,
when invoked in a function attached to a
ddl_command_end
event trigger. If called in any other
context, an error is raised.
pg_event_trigger_ddl_commands
returns one row for each
base command executed; some commands that are a single SQL sentence
may return more than one row. This function returns the following
columns:
Name | Type | Description |
---|---|---|
classid | oid | OID of catalog the object belongs in |
objid | oid | OID of the object itself |
objsubid | integer | Sub-object ID (e.g., attribute number for a column) |
command_tag | text | Command tag |
object_type | text | Type of the object |
schema_name | text |
Name of the schema the object belongs in, if any; otherwise NULL .
No quoting is applied.
|
object_identity | text | Text rendering of the object identity, schema-qualified. Each identifier included in the identity is quoted if necessary. |
in_extension | boolean | True if the command is part of an extension script |
command | pg_ddl_command | A complete representation of the command, in internal format. This cannot be output directly, but it can be passed to other functions to obtain different pieces of information about the command. |
pg_event_trigger_dropped_objects
() →setof record
pg_event_trigger_dropped_objects
returns a list of all objects
dropped by the command in whose sql_drop
event it is called.
If called in any other context, an error is raised.
This function returns the following columns:
Name | Type | Description |
---|---|---|
classid | oid | OID of catalog the object belonged in |
objid | oid | OID of the object itself |
objsubid | integer | Sub-object ID (e.g., attribute number for a column) |
original | boolean | True if this was one of the root object(s) of the deletion |
normal | boolean | True if there was a normal dependency relationship in the dependency graph leading to this object |
is_temporary | boolean | True if this was a temporary object |
object_type | text | Type of the object |
schema_name | text |
Name of the schema the object belonged in, if any; otherwise NULL .
No quoting is applied.
|
object_name | text |
Name of the object, if the combination of schema and name can be
used as a unique identifier for the object; otherwise NULL .
No quoting is applied, and name is never schema-qualified.
|
object_identity | text | Text rendering of the object identity, schema-qualified. Each identifier included in the identity is quoted if necessary. |
address_names | text[] |
An array that, together with object_type and
address_args , can be used by
the pg_get_object_address function to
recreate the object address in a remote server containing an
identically named object of the same kind.
|
address_args | text[] |
Complement for address_names
|
The pg_event_trigger_dropped_objects
function can be used
in an event trigger like this:
CREATE FUNCTION test_event_trigger_for_drops() RETURNS event_trigger LANGUAGE plpgsql AS $$ DECLARE obj record; BEGIN FOR obj IN SELECT * FROM pg_event_trigger_dropped_objects() LOOP RAISE NOTICE '% dropped object: % %.% %', tg_tag, obj.object_type, obj.schema_name, obj.object_name, obj.object_identity; END LOOP; END; $$; CREATE EVENT TRIGGER test_event_trigger_for_drops ON sql_drop EXECUTE FUNCTION test_event_trigger_for_drops();
The functions shown in
Table 9.100
provide information about a table for which a
table_rewrite
event has just been called.
If called in any other context, an error is raised.
Table 9.100. Table Rewrite Information Functions
These functions can be used in an event trigger like this:
CREATE FUNCTION test_event_trigger_table_rewrite_oid() RETURNS event_trigger LANGUAGE plpgsql AS $$ BEGIN RAISE NOTICE 'rewriting table % for reason %', pg_event_trigger_table_rewrite_oid()::regclass, pg_event_trigger_table_rewrite_reason(); END; $$; CREATE EVENT TRIGGER test_table_rewrite_oid ON table_rewrite EXECUTE FUNCTION test_event_trigger_table_rewrite_oid();
PostgreSQL provides a function to inspect complex
statistics defined using the CREATE STATISTICS
command.
pg_mcv_list_items
(pg_mcv_list
) →setof record
pg_mcv_list_items
returns a set of records describing
all items stored in a multi-column MCV list. It
returns the following columns:
Name | Type | Description |
---|---|---|
index | integer | index of the item in the MCV list |
values | text[] | values stored in the MCV item |
nulls | boolean[] | flags identifying NULL values |
frequency | double precision | frequency of this MCV item |
base_frequency | double precision | base frequency of this MCV item |
The pg_mcv_list_items
function can be used like this:
SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid), pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts';
Values of the pg_mcv_list
type can be obtained only from the
pg_statistic_ext_data
.stxdmcv
column.
[8]
A result containing more than one element node at the top level, or
non-whitespace text outside of an element, is an example of content form.
An XPath result can be of neither form, for example if it returns an
attribute node selected from the element that contains it. Such a result
will be put into content form with each such disallowed node replaced by
its string value, as defined for the XPath 1.0
string
function.
Table of Contents
SQL statements can, intentionally or not, require the mixing of different data types in the same expression. PostgreSQL has extensive facilities for evaluating mixed-type expressions.
In many cases a user does not need to understand the details of the type conversion mechanism. However, implicit conversions done by PostgreSQL can affect the results of a query. When necessary, these results can be tailored by using explicit type conversion.
This chapter introduces the PostgreSQL type conversion mechanisms and conventions. Refer to the relevant sections in Chapter 8 and Chapter 9 for more information on specific data types and allowed functions and operators.
SQL is a strongly typed language. That is, every data item has an associated data type which determines its behavior and allowed usage. PostgreSQL has an extensible type system that is more general and flexible than other SQL implementations. Hence, most type conversion behavior in PostgreSQL is governed by general rules rather than by ad hoc heuristics. This allows the use of mixed-type expressions even with user-defined types.
The PostgreSQL scanner/parser divides lexical elements into five fundamental categories: integers, non-integer numbers, strings, identifiers, and key words. Constants of most non-numeric types are first classified as strings. The SQL language definition allows specifying type names with strings, and this mechanism can be used in PostgreSQL to start the parser down the correct path. For example, the query:
SELECT text 'Origin' AS "label", point '(0,0)' AS "value"; label | value --------+------- Origin | (0,0) (1 row)
has two literal constants, of type text
and point
.
If a type is not specified for a string literal, then the placeholder type
unknown
is assigned initially, to be resolved in later
stages as described below.
There are four fundamental SQL constructs requiring distinct type conversion rules in the PostgreSQL parser:
Much of the PostgreSQL type system is built around a rich set of functions. Functions can have one or more arguments. Since PostgreSQL permits function overloading, the function name alone does not uniquely identify the function to be called; the parser must select the right function based on the data types of the supplied arguments.
PostgreSQL allows expressions with prefix (one-argument) operators, as well as infix (two-argument) operators. Like functions, operators can be overloaded, so the same problem of selecting the right operator exists.
SQL INSERT
and UPDATE
statements place the results of
expressions into a table. The expressions in the statement must be matched up
with, and perhaps converted to, the types of the target columns.
UNION
, CASE
, and related constructs
Since all query results from a unionized SELECT
statement
must appear in a single set of columns, the types of the results of each
SELECT
clause must be matched up and converted to a uniform set.
Similarly, the result expressions of a CASE
construct must be
converted to a common type so that the CASE
expression as a whole
has a known output type. Some other constructs, such
as ARRAY[]
and the GREATEST
and LEAST
functions, likewise require determination of a
common type for several subexpressions.
The system catalogs store information about which conversions, or casts, exist between which data types, and how to perform those conversions. Additional casts can be added by the user with the CREATE CAST command. (This is usually done in conjunction with defining new data types. The set of casts between built-in types has been carefully crafted and is best not altered.)
An additional heuristic provided by the parser allows improved determination
of the proper casting behavior among groups of types that have implicit casts.
Data types are divided into several basic type
categories, including boolean
, numeric
,
string
, bitstring
, datetime
,
timespan
, geometric
, network
, and
user-defined. (For a list see Table 52.63;
but note it is also possible to create custom type categories.) Within each
category there can be one or more preferred types, which
are preferred when there is a choice of possible types. With careful selection
of preferred types and available implicit casts, it is possible to ensure that
ambiguous expressions (those with multiple candidate parsing solutions) can be
resolved in a useful way.
All type conversion rules are designed with several principles in mind:
Implicit conversions should never have surprising or unpredictable outcomes.
There should be no extra overhead in the parser or executor if a query does not need implicit type conversion. That is, if a query is well-formed and the types already match, then the query should execute without spending extra time in the parser and without introducing unnecessary implicit conversion calls in the query.
Additionally, if a query usually requires an implicit conversion for a function, and if then the user defines a new function with the correct argument types, the parser should use this new function and no longer do implicit conversion to use the old function.
The specific operator that is referenced by an operator expression is determined using the following procedure. Note that this procedure is indirectly affected by the precedence of the operators involved, since that will determine which sub-expressions are taken to be the inputs of which operators. See Section 4.1.6 for more information.
Operator Type Resolution
Select the operators to be considered from the
pg_operator
system catalog. If a non-schema-qualified
operator name was used (the usual case), the operators
considered are those with the matching name and argument count that are
visible in the current search path (see Section 5.9.3).
If a qualified operator name was given, only operators in the specified
schema are considered.
If the search path finds multiple operators with identical argument types, only the one appearing earliest in the path is considered. Operators with different argument types are considered on an equal footing regardless of search path position.
Check for an operator accepting exactly the input argument types. If one exists (there can be only one exact match in the set of operators considered), use it. Lack of an exact match creates a security hazard when calling, via qualified name [9] (not typical), any operator found in a schema that permits untrusted users to create objects. In such situations, cast arguments to force an exact match.
If one argument of a binary operator invocation is of the unknown
type,
then assume it is the same type as the other argument for this check.
Invocations involving two unknown
inputs, or a prefix operator
with an unknown
input, will never find a match at this step.
If one argument of a binary operator invocation is of the unknown
type and the other is of a domain type, next check to see if there is an
operator accepting exactly the domain's base type on both sides; if so, use it.
Look for the best match.
Discard candidate operators for which the input types do not match
and cannot be converted (using an implicit conversion) to match.
unknown
literals are
assumed to be convertible to anything for this purpose. If only one
candidate remains, use it; else continue to the next step.
If any input argument is of a domain type, treat it as being of the domain's base type for all subsequent steps. This ensures that domains act like their base types for purposes of ambiguous-operator resolution.
Run through all candidates and keep those with the most exact matches on input types. Keep all candidates if none have exact matches. If only one candidate remains, use it; else continue to the next step.
Run through all candidates and keep those that accept preferred types (of the input data type's type category) at the most positions where type conversion will be required. Keep all candidates if none accept preferred types. If only one candidate remains, use it; else continue to the next step.
If any input arguments are unknown
, check the type
categories accepted at those argument positions by the remaining
candidates. At each position, select the string
category
if any
candidate accepts that category. (This bias towards string is appropriate
since an unknown-type literal looks like a string.) Otherwise, if
all the remaining candidates accept the same type category, select that
category; otherwise fail because the correct choice cannot be deduced
without more clues. Now discard
candidates that do not accept the selected type category. Furthermore,
if any candidate accepts a preferred type in that category,
discard candidates that accept non-preferred types for that argument.
Keep all candidates if none survive these tests.
If only one candidate remains, use it; else continue to the next step.
If there are both unknown
and known-type arguments, and all
the known-type arguments have the same type, assume that the
unknown
arguments are also of that type, and check which
candidates can accept that type at the unknown
-argument
positions. If exactly one candidate passes this test, use it.
Otherwise, fail.
Some examples follow.
Example 10.1. Square Root Operator Type Resolution
There is only one square root operator (prefix |/
)
defined in the standard catalog, and it takes an argument of type
double precision
.
The scanner assigns an initial type of integer
to the argument
in this query expression:
SELECT |/ 40 AS "square root of 40"; square root of 40 ------------------- 6.324555320336759 (1 row)
So the parser does a type conversion on the operand and the query is equivalent to:
SELECT |/ CAST(40 AS double precision) AS "square root of 40";
Example 10.2. String Concatenation Operator Type Resolution
A string-like syntax is used for working with string types and for working with complex extension types. Strings with unspecified type are matched with likely operator candidates.
An example with one unspecified argument:
SELECT text 'abc' || 'def' AS "text and unknown"; text and unknown ------------------ abcdef (1 row)
In this case the parser looks to see if there is an operator taking text
for both arguments. Since there is, it assumes that the second argument should
be interpreted as type text
.
Here is a concatenation of two values of unspecified types:
SELECT 'abc' || 'def' AS "unspecified"; unspecified ------------- abcdef (1 row)
In this case there is no initial hint for which type to use, since no types
are specified in the query. So, the parser looks for all candidate operators
and finds that there are candidates accepting both string-category and
bit-string-category inputs. Since string category is preferred when available,
that category is selected, and then the
preferred type for strings, text
, is used as the specific
type to resolve the unknown-type literals as.
Example 10.3. Absolute-Value and Negation Operator Type Resolution
The PostgreSQL operator catalog has several
entries for the prefix operator @
, all of which implement
absolute-value operations for various numeric data types. One of these
entries is for type float8
, which is the preferred type in
the numeric category. Therefore, PostgreSQL
will use that entry when faced with an unknown
input:
SELECT @ '-4.5' AS "abs"; abs ----- 4.5 (1 row)
Here the system has implicitly resolved the unknown-type literal as type
float8
before applying the chosen operator. We can verify that
float8
and not some other type was used:
SELECT @ '-4.5e500' AS "abs"; ERROR: "-4.5e500" is out of range for type double precision
On the other hand, the prefix operator ~
(bitwise negation)
is defined only for integer data types, not for float8
. So, if we
try a similar case with ~
, we get:
SELECT ~ '20' AS "negation"; ERROR: operator is not unique: ~ "unknown" HINT: Could not choose a best candidate operator. You might need to add explicit type casts.
This happens because the system cannot decide which of the several
possible ~
operators should be preferred. We can help
it out with an explicit cast:
SELECT ~ CAST('20' AS int8) AS "negation"; negation ---------- -21 (1 row)
Example 10.4. Array Inclusion Operator Type Resolution
Here is another example of resolving an operator with one known and one unknown input:
SELECT array[1,2] <@ '{1,2,3}' as "is subset"; is subset ----------- t (1 row)
The PostgreSQL operator catalog has several
entries for the infix operator <@
, but the only two that
could possibly accept an integer array on the left-hand side are
array inclusion (anyarray
<@
anyarray
)
and range inclusion (anyelement
<@
anyrange
).
Since none of these polymorphic pseudo-types (see Section 8.21) are considered preferred, the parser cannot
resolve the ambiguity on that basis.
However, Step 3.f tells
it to assume that the unknown-type literal is of the same type as the other
input, that is, integer array. Now only one of the two operators can match,
so array inclusion is selected. (Had range inclusion been selected, we would
have gotten an error, because the string does not have the right format to be
a range literal.)
Example 10.5. Custom Operator on a Domain Type
Users sometimes try to declare operators applying just to a domain type. This is possible but is not nearly as useful as it might seem, because the operator resolution rules are designed to select operators applying to the domain's base type. As an example consider
CREATE DOMAIN mytext AS text CHECK(...); CREATE FUNCTION mytext_eq_text (mytext, text) RETURNS boolean AS ...; CREATE OPERATOR = (procedure=mytext_eq_text, leftarg=mytext, rightarg=text); CREATE TABLE mytable (val mytext); SELECT * FROM mytable WHERE val = 'foo';
This query will not use the custom operator. The parser will first see if
there is a mytext
=
mytext
operator
(Step 2.a), which there is not;
then it will consider the domain's base type text
, and see if
there is a text
=
text
operator
(Step 2.b), which there is;
so it resolves the unknown
-type literal as text
and
uses the text
=
text
operator.
The only way to get the custom operator to be used is to explicitly cast
the literal:
SELECT * FROM mytable WHERE val = text 'foo';
so that the mytext
=
text
operator is found
immediately according to the exact-match rule. If the best-match rules
are reached, they actively discriminate against operators on domain types.
If they did not, such an operator would create too many ambiguous-operator
failures, because the casting rules always consider a domain as castable
to or from its base type, and so the domain operator would be considered
usable in all the same cases as a similarly-named operator on the base type.
The specific function that is referenced by a function call is determined using the following procedure.
Function Type Resolution
Select the functions to be considered from the
pg_proc
system catalog. If a non-schema-qualified
function name was used, the functions
considered are those with the matching name and argument count that are
visible in the current search path (see Section 5.9.3).
If a qualified function name was given, only functions in the specified
schema are considered.
If the search path finds multiple functions of identical argument types, only the one appearing earliest in the path is considered. Functions of different argument types are considered on an equal footing regardless of search path position.
If a function is declared with a VARIADIC
array parameter, and
the call does not use the VARIADIC
keyword, then the function
is treated as if the array parameter were replaced by one or more occurrences
of its element type, as needed to match the call. After such expansion the
function might have effective argument types identical to some non-variadic
function. In that case the function appearing earlier in the search path is
used, or if the two functions are in the same schema, the non-variadic one is
preferred.
This creates a security hazard when calling, via qualified name
[10],
a variadic function found in a schema that permits untrusted users to create
objects. A malicious user can take control and execute arbitrary SQL
functions as though you executed them. Substitute a call bearing
the VARIADIC
keyword, which bypasses this hazard. Calls
populating VARIADIC "any"
parameters often have no
equivalent formulation containing the VARIADIC
keyword. To
issue those calls safely, the function's schema must permit only trusted users
to create objects.
Functions that have default values for parameters are considered to match any call that omits zero or more of the defaultable parameter positions. If more than one such function matches a call, the one appearing earliest in the search path is used. If there are two or more such functions in the same schema with identical parameter types in the non-defaulted positions (which is possible if they have different sets of defaultable parameters), the system will not be able to determine which to prefer, and so an “ambiguous function call” error will result if no better match to the call can be found.
This creates an availability hazard when calling, via qualified name[10], any function found in a schema that permits untrusted users to create objects. A malicious user can create a function with the name of an existing function, replicating that function's parameters and appending novel parameters having default values. This precludes new calls to the original function. To forestall this hazard, place functions in schemas that permit only trusted users to create objects.
Check for a function accepting exactly the input argument types.
If one exists (there can be only one exact match in the set of
functions considered), use it. Lack of an exact match creates a security
hazard when calling, via qualified
name[10], a function found in a
schema that permits untrusted users to create objects. In such situations,
cast arguments to force an exact match. (Cases involving unknown
will never find a match at this step.)
If no exact match is found, see if the function call appears
to be a special type conversion request. This happens if the function call
has just one argument and the function name is the same as the (internal)
name of some data type. Furthermore, the function argument must be either
an unknown-type literal, or a type that is binary-coercible to the named
data type, or a type that could be converted to the named data type by
applying that type's I/O functions (that is, the conversion is either to or
from one of the standard string types). When these conditions are met,
the function call is treated as a form of CAST
specification.
[11]
Look for the best match.
Discard candidate functions for which the input types do not match
and cannot be converted (using an implicit conversion) to match.
unknown
literals are
assumed to be convertible to anything for this purpose. If only one
candidate remains, use it; else continue to the next step.
If any input argument is of a domain type, treat it as being of the domain's base type for all subsequent steps. This ensures that domains act like their base types for purposes of ambiguous-function resolution.
Run through all candidates and keep those with the most exact matches on input types. Keep all candidates if none have exact matches. If only one candidate remains, use it; else continue to the next step.
Run through all candidates and keep those that accept preferred types (of the input data type's type category) at the most positions where type conversion will be required. Keep all candidates if none accept preferred types. If only one candidate remains, use it; else continue to the next step.
If any input arguments are unknown
, check the type categories
accepted
at those argument positions by the remaining candidates. At each position,
select the string
category if any candidate accepts that category.
(This bias towards string
is appropriate since an unknown-type literal looks like a string.)
Otherwise, if all the remaining candidates accept the same type category,
select that category; otherwise fail because
the correct choice cannot be deduced without more clues.
Now discard candidates that do not accept the selected type category.
Furthermore, if any candidate accepts a preferred type in that category,
discard candidates that accept non-preferred types for that argument.
Keep all candidates if none survive these tests.
If only one candidate remains, use it; else continue to the next step.
If there are both unknown
and known-type arguments, and all
the known-type arguments have the same type, assume that the
unknown
arguments are also of that type, and check which
candidates can accept that type at the unknown
-argument
positions. If exactly one candidate passes this test, use it.
Otherwise, fail.
Note that the “best match” rules are identical for operator and function type resolution. Some examples follow.
Example 10.6. Rounding Function Argument Type Resolution
There is only one round
function that takes two
arguments; it takes a first argument of type numeric
and
a second argument of type integer
.
So the following query automatically converts
the first argument of type integer
to
numeric
:
SELECT round(4, 4); round -------- 4.0000 (1 row)
That query is actually transformed by the parser to:
SELECT round(CAST (4 AS numeric), 4);
Since numeric constants with decimal points are initially assigned the
type numeric
, the following query will require no type
conversion and therefore might be slightly more efficient:
SELECT round(4.0, 4);
Example 10.7. Variadic Function Resolution
CREATE FUNCTION public.variadic_example(VARIADIC numeric[]) RETURNS int LANGUAGE sql AS 'SELECT 1'; CREATE FUNCTION
This function accepts, but does not require, the VARIADIC keyword. It tolerates both integer and numeric arguments:
SELECT public.variadic_example(0), public.variadic_example(0.0), public.variadic_example(VARIADIC array[0.0]); variadic_example | variadic_example | variadic_example ------------------+------------------+------------------ 1 | 1 | 1 (1 row)
However, the first and second calls will prefer more-specific functions, if available:
CREATE FUNCTION public.variadic_example(numeric) RETURNS int LANGUAGE sql AS 'SELECT 2'; CREATE FUNCTION CREATE FUNCTION public.variadic_example(int) RETURNS int LANGUAGE sql AS 'SELECT 3'; CREATE FUNCTION SELECT public.variadic_example(0), public.variadic_example(0.0), public.variadic_example(VARIADIC array[0.0]); variadic_example | variadic_example | variadic_example ------------------+------------------+------------------ 3 | 2 | 1 (1 row)
Given the default configuration and only the first function existing, the
first and second calls are insecure. Any user could intercept them by
creating the second or third function. By matching the argument type exactly
and using the VARIADIC
keyword, the third call is secure.
Example 10.8. Substring Function Type Resolution
There are several substr
functions, one of which
takes types text
and integer
. If called
with a string constant of unspecified type, the system chooses the
candidate function that accepts an argument of the preferred category
string
(namely of type text
).
SELECT substr('1234', 3); substr -------- 34 (1 row)
If the string is declared to be of type varchar
, as might be the case
if it comes from a table, then the parser will try to convert it to become text
:
SELECT substr(varchar '1234', 3); substr -------- 34 (1 row)
This is transformed by the parser to effectively become:
SELECT substr(CAST (varchar '1234' AS text), 3);
The parser learns from the pg_cast
catalog that
text
and varchar
are binary-compatible, meaning that one can be passed to a function that
accepts the other without doing any physical conversion. Therefore, no
type conversion call is really inserted in this case.
And, if the function is called with an argument of type integer
,
the parser will try to convert that to text
:
SELECT substr(1234, 3); ERROR: function substr(integer, integer) does not exist HINT: No function matches the given name and argument types. You might need to add explicit type casts.
This does not work because integer
does not have an implicit cast
to text
. An explicit cast will work, however:
SELECT substr(CAST (1234 AS text), 3); substr -------- 34 (1 row)
Values to be inserted into a table are converted to the destination column's data type according to the following steps.
Value Storage Type Conversion
Check for an exact match with the target.
Otherwise, try to convert the expression to the target type. This is possible
if an assignment cast between the two types is registered in the
pg_cast
catalog (see CREATE CAST).
Alternatively, if the expression is an unknown-type literal, the contents of
the literal string will be fed to the input conversion routine for the target
type.
Check to see if there is a sizing cast for the target type. A sizing
cast is a cast from that type to itself. If one is found in the
pg_cast
catalog, apply it to the expression before storing
into the destination column. The implementation function for such a cast
always takes an extra parameter of type integer
, which receives
the destination column's atttypmod
value (typically its
declared length, although the interpretation of atttypmod
varies for different data types), and it may take a third boolean
parameter that says whether the cast is explicit or implicit. The cast
function
is responsible for applying any length-dependent semantics such as size
checking or truncation.
Example 10.9. character
Storage Type Conversion
For a target column declared as character(20)
the following
statement shows that the stored value is sized correctly:
CREATE TABLE vv (v character(20)); INSERT INTO vv SELECT 'abc' || 'def'; SELECT v, octet_length(v) FROM vv; v | octet_length ----------------------+-------------- abcdef | 20 (1 row)
What has really happened here is that the two unknown literals are resolved
to text
by default, allowing the ||
operator
to be resolved as text
concatenation. Then the text
result of the operator is converted to bpchar
(“blank-padded
char”, the internal name of the character
data type) to match the target
column type. (Since the conversion from text
to
bpchar
is binary-coercible, this conversion does
not insert any real function call.) Finally, the sizing function
bpchar(bpchar, integer, boolean)
is found in the system catalog
and applied to the operator's result and the stored column length. This
type-specific function performs the required length check and addition of
padding spaces.
UNION
, CASE
, and Related Constructs
SQL UNION
constructs must match up possibly dissimilar
types to become a single result set. The resolution algorithm is
applied separately to each output column of a union query. The
INTERSECT
and EXCEPT
constructs resolve
dissimilar types in the same way as UNION
.
Some other constructs, including
CASE
, ARRAY
, VALUES
,
and the GREATEST
and LEAST
functions, use the identical
algorithm to match up their component expressions and select a result
data type.
Type Resolution for UNION
, CASE
,
and Related Constructs
If all inputs are of the same type, and it is not unknown
,
resolve as that type.
If any input is of a domain type, treat it as being of the domain's base type for all subsequent steps. [12]
If all inputs are of type unknown
, resolve as type
text
(the preferred type of the string category).
Otherwise, unknown
inputs are ignored for the purposes
of the remaining rules.
If the non-unknown inputs are not all of the same type category, fail.
Select the first non-unknown input type as the candidate type, then consider each other non-unknown input type, left to right. [13] If the candidate type can be implicitly converted to the other type, but not vice-versa, select the other type as the new candidate type. Then continue considering the remaining inputs. If, at any stage of this process, a preferred type is selected, stop considering additional inputs.
Convert all inputs to the final candidate type. Fail if there is not an implicit conversion from a given input type to the candidate type.
Some examples follow.
Example 10.10. Type Resolution with Underspecified Types in a Union
SELECT text 'a' AS "text" UNION SELECT 'b'; text ------ a b (2 rows)
Here, the unknown-type literal 'b'
will be resolved to type text
.
Example 10.11. Type Resolution in a Simple Union
SELECT 1.2 AS "numeric" UNION SELECT 1; numeric --------- 1 1.2 (2 rows)
The literal 1.2
is of type numeric
,
and the integer
value 1
can be cast implicitly to
numeric
, so that type is used.
Example 10.12. Type Resolution in a Transposed Union
SELECT 1 AS "real" UNION SELECT CAST('2.2' AS REAL); real ------ 1 2.2 (2 rows)
Here, since type real
cannot be implicitly cast to integer
,
but integer
can be implicitly cast to real
, the union
result type is resolved as real
.
Example 10.13. Type Resolution in a Nested Union
SELECT NULL UNION SELECT NULL UNION SELECT 1; ERROR: UNION types text and integer cannot be matched
This failure occurs because PostgreSQL treats
multiple UNION
s as a nest of pairwise operations;
that is, this input is the same as
(SELECT NULL UNION SELECT NULL) UNION SELECT 1;
The inner UNION
is resolved as emitting
type text
, according to the rules given above. Then the
outer UNION
has inputs of types text
and integer
, leading to the observed error. The problem
can be fixed by ensuring that the leftmost UNION
has at least one input of the desired result type.
INTERSECT
and EXCEPT
operations are
likewise resolved pairwise. However, the other constructs described in this
section consider all of their inputs in one resolution step.
SELECT
Output Columns
The rules given in the preceding sections will result in assignment
of non-unknown
data types to all expressions in an SQL query,
except for unspecified-type literals that appear as simple output
columns of a SELECT
command. For example, in
SELECT 'Hello World';
there is nothing to identify what type the string literal should be
taken as. In this situation PostgreSQL will fall back
to resolving the literal's type as text
.
When the SELECT
is one arm of a UNION
(or INTERSECT
or EXCEPT
) construct, or when it
appears within INSERT ... SELECT
, this rule is not applied
since rules given in preceding sections take precedence. The type of an
unspecified-type literal can be taken from the other UNION
arm
in the first case, or from the destination column in the second case.
RETURNING
lists are treated the same as SELECT
output lists for this purpose.
Prior to PostgreSQL 10, this rule did not exist, and
unspecified-type literals in a SELECT
output list were
left as type unknown
. That had assorted bad consequences,
so it's been changed.
[9] The hazard does not arise with a non-schema-qualified name, because a search path containing schemas that permit untrusted users to create objects is not a secure schema usage pattern.
[10] The hazard does not arise with a non-schema-qualified name, because a search path containing schemas that permit untrusted users to create objects is not a secure schema usage pattern.
[11] The reason for this step is to support function-style cast specifications in cases where there is not an actual cast function. If there is a cast function, it is conventionally named after its output type, and so there is no need to have a special case. See CREATE CAST for additional commentary.
[12]
Somewhat like the treatment of domain inputs for operators and
functions, this behavior allows a domain type to be preserved through
a UNION
or similar construct, so long as the user is
careful to ensure that all inputs are implicitly or explicitly of that
exact type. Otherwise the domain's base type will be used.
[13]
For historical reasons, CASE
treats
its ELSE
clause (if any) as the “first”
input, with the THEN
clauses(s) considered after
that. In all other cases, “left to right” means the order
in which the expressions appear in the query text.
Table of Contents
ORDER BY
Indexes are a common way to enhance database performance. An index allows the database server to find and retrieve specific rows much faster than it could do without an index. But indexes also add overhead to the database system as a whole, so they should be used sensibly.
Suppose we have a table similar to this:
CREATE TABLE test1 ( id integer, content varchar );
and the application issues many queries of the form:
SELECT content FROM test1 WHERE id = constant
;
With no advance preparation, the system would have to scan the entire
test1
table, row by row, to find all
matching entries. If there are many rows in
test1
and only a few rows (perhaps zero
or one) that would be returned by such a query, this is clearly an
inefficient method. But if the system has been instructed to maintain an
index on the id
column, it can use a more
efficient method for locating matching rows. For instance, it
might only have to walk a few levels deep into a search tree.
A similar approach is used in most non-fiction books: terms and concepts that are frequently looked up by readers are collected in an alphabetic index at the end of the book. The interested reader can scan the index relatively quickly and flip to the appropriate page(s), rather than having to read the entire book to find the material of interest. Just as it is the task of the author to anticipate the items that readers are likely to look up, it is the task of the database programmer to foresee which indexes will be useful.
The following command can be used to create an index on the
id
column, as discussed:
CREATE INDEX test1_id_index ON test1 (id);
The name test1_id_index
can be chosen
freely, but you should pick something that enables you to remember
later what the index was for.
To remove an index, use the DROP INDEX
command.
Indexes can be added to and removed from tables at any time.
Once an index is created, no further intervention is required: the
system will update the index when the table is modified, and it will
use the index in queries when it thinks doing so would be more efficient
than a sequential table scan. But you might have to run the
ANALYZE
command regularly to update
statistics to allow the query planner to make educated decisions.
See Chapter 14 for information about
how to find out whether an index is used and when and why the
planner might choose not to use an index.
Indexes can also benefit UPDATE
and
DELETE
commands with search conditions.
Indexes can moreover be used in join searches. Thus,
an index defined on a column that is part of a join condition can
also significantly speed up queries with joins.
In general, PostgreSQL indexes can be used
to optimize queries that contain one or more WHERE
or JOIN
clauses of the form
indexed-column
indexable-operator
comparison-value
Here, the indexed-column
is whatever
column or expression the index has been defined on.
The indexable-operator
is an operator that
is a member of the index's operator class for
the indexed column. (More details about that appear below.)
And the comparison-value
can be any
expression that is not volatile and does not reference the index's
table.
In some cases the query planner can extract an indexable clause of this form from another SQL construct. A simple example is that if the original clause was
comparison-value
operator
indexed-column
then it can be flipped around into indexable form if the
original operator
has a commutator
operator that is a member of the index's operator class.
Creating an index on a large table can take a long time. By default,
PostgreSQL allows reads (SELECT
statements) to occur
on the table in parallel with index creation, but writes (INSERT
,
UPDATE
, DELETE
) are blocked until the index build is finished.
In production environments this is often unacceptable.
It is possible to allow writes to occur in parallel with index
creation, but there are several caveats to be aware of —
for more information see Building Indexes Concurrently.
After an index is created, the system has to keep it synchronized with the table. This adds overhead to data manipulation operations. Indexes can also prevent the creation of heap-only tuples. Therefore indexes that are seldom or never used in queries should be removed.
PostgreSQL provides several index types:
B-tree, Hash, GiST, SP-GiST, GIN, BRIN, and the extension bloom.
Each index type uses a different
algorithm that is best suited to different types of indexable clauses.
By default, the CREATE
INDEX
command creates
B-tree indexes, which fit the most common situations.
The other index types are selected by writing the keyword
USING
followed by the index type name.
For example, to create a Hash index:
CREATE INDEXname
ONtable
USING HASH (column
);
B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of these operators:
< <= = >= >
Constructs equivalent to combinations of these operators, such as
BETWEEN
and IN
, can also be implemented with
a B-tree index search. Also, an IS NULL
or IS NOT
NULL
condition on an index column can be used with a B-tree index.
The optimizer can also use a B-tree index for queries involving the
pattern matching operators LIKE
and ~
if the pattern is a constant and is anchored to
the beginning of the string — for example, col LIKE
'foo%'
or col ~ '^foo'
, but not
col LIKE '%bar'
. However, if your database does not
use the C locale you will need to create the index with a special
operator class to support indexing of pattern-matching queries; see
Section 11.10 below. It is also possible to use
B-tree indexes for ILIKE
and
~*
, but only if the pattern starts with
non-alphabetic characters, i.e., characters that are not affected by
upper/lower case conversion.
B-tree indexes can also be used to retrieve data in sorted order. This is not always faster than a simple scan and sort, but it is often helpful.
Hash indexes store a 32-bit hash code derived from the value of the indexed column. Hence, such indexes can only handle simple equality comparisons. The query planner will consider using a hash index whenever an indexed column is involved in a comparison using the equal operator:
=
GiST indexes are not a single kind of index, but rather an infrastructure within which many different indexing strategies can be implemented. Accordingly, the particular operators with which a GiST index can be used vary depending on the indexing strategy (the operator class). As an example, the standard distribution of PostgreSQL includes GiST operator classes for several two-dimensional geometric data types, which support indexed queries using these operators:
<< &< &> >> <<| &<| |&> |>> @> <@ ~= &&
(See Section 9.11 for the meaning of
these operators.)
The GiST operator classes included in the standard distribution are
documented in Table 65.1.
Many other GiST operator
classes are available in the contrib
collection or as separate
projects. For more information see Chapter 65.
GiST indexes are also capable of optimizing “nearest-neighbor” searches, such as
SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
which finds the ten places closest to a given target point. The ability to do this is again dependent on the particular operator class being used. In Table 65.1, operators that can be used in this way are listed in the column “Ordering Operators”.
SP-GiST indexes, like GiST indexes, offer an infrastructure that supports various kinds of searches. SP-GiST permits implementation of a wide range of different non-balanced disk-based data structures, such as quadtrees, k-d trees, and radix trees (tries). As an example, the standard distribution of PostgreSQL includes SP-GiST operator classes for two-dimensional points, which support indexed queries using these operators:
<< >> ~= <@ <<| |>>
(See Section 9.11 for the meaning of these operators.) The SP-GiST operator classes included in the standard distribution are documented in Table 66.1. For more information see Chapter 66.
Like GiST, SP-GiST supports “nearest-neighbor” searches. For SP-GiST operator classes that support distance ordering, the corresponding operator is listed in the “Ordering Operators” column in Table 66.1.
GIN indexes are “inverted indexes” which are appropriate for data values that contain multiple component values, such as arrays. An inverted index contains a separate entry for each component value, and can efficiently handle queries that test for the presence of specific component values.
Like GiST and SP-GiST, GIN can support many different user-defined indexing strategies, and the particular operators with which a GIN index can be used vary depending on the indexing strategy. As an example, the standard distribution of PostgreSQL includes a GIN operator class for arrays, which supports indexed queries using these operators:
<@ @> = &&
(See Section 9.19 for the meaning of
these operators.)
The GIN operator classes included in the standard distribution are
documented in Table 67.1.
Many other GIN operator
classes are available in the contrib
collection or as separate
projects. For more information see Chapter 67.
BRIN indexes (a shorthand for Block Range INdexes) store summaries about the values stored in consecutive physical block ranges of a table. Thus, they are most effective for columns whose values are well-correlated with the physical order of the table rows. Like GiST, SP-GiST and GIN, BRIN can support many different indexing strategies, and the particular operators with which a BRIN index can be used vary depending on the indexing strategy. For data types that have a linear sort order, the indexed data corresponds to the minimum and maximum values of the values in the column for each block range. This supports indexed queries using these operators:
< <= = >= >
The BRIN operator classes included in the standard distribution are documented in Table 68.1. For more information see Chapter 68.
An index can be defined on more than one column of a table. For example, if you have a table of this form:
CREATE TABLE test2 ( major int, minor int, name varchar );
(say, you keep your /dev
directory in a database...) and you frequently issue queries like:
SELECT name FROM test2 WHERE major =constant
AND minor =constant
;
then it might be appropriate to define an index on the columns
major
and
minor
together, e.g.:
CREATE INDEX test2_mm_idx ON test2 (major, minor);
Currently, only the B-tree, GiST, GIN, and BRIN index types support
multiple-key-column indexes. Whether there can be multiple key
columns is independent of whether INCLUDE
columns
can be added to the index. Indexes can have up to 32 columns,
including INCLUDE
columns. (This limit can be
altered when building PostgreSQL; see the
file pg_config_manual.h
.)
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
efficient when there are constraints on the leading (leftmost) columns.
The exact rule is that equality constraints on leading columns, plus
any inequality constraints on the first column that does not have an
equality constraint, will be used to limit the portion of the index
that is scanned. Constraints on columns to the right of these columns
are checked in the index, so they save visits to the table proper, but
they do not reduce the portion of the index that has to be scanned.
For example, given an index on (a, b, c)
and a
query condition WHERE a = 5 AND b >= 42 AND c < 77
,
the index would have to be scanned from the first entry with
a
= 5 and b
= 42 up through the last entry with
a
= 5. Index entries with c
>= 77 would be
skipped, but they'd still have to be scanned through.
This index could in principle be used for queries that have constraints
on b
and/or c
with no constraint on a
— but the entire index would have to be scanned, so in most cases
the planner would prefer a sequential table scan over using the index.
A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions on additional columns restrict the entries returned by the index, but the condition on the first column is the most important one for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.
A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.
A multicolumn BRIN index can be used with query conditions that
involve any subset of the index's columns. Like GIN and unlike B-tree or
GiST, index search effectiveness is the same regardless of which index
column(s) the query conditions use. The only reason to have multiple BRIN
indexes instead of one multicolumn BRIN index on a single table is to have
a different pages_per_range
storage parameter.
Of course, each column must be used with operators appropriate to the index type; clauses that involve other operators will not be considered.
Multicolumn indexes should be used sparingly. In most situations, an index on a single column is sufficient and saves space and time. Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized. See also Section 11.5 and Section 11.9 for some discussion of the merits of different index configurations.
ORDER BY
In addition to simply finding the rows to be returned by a query,
an index may be able to deliver them in a specific sorted order.
This allows a query's ORDER BY
specification to be honored
without a separate sorting step. Of the index types currently
supported by PostgreSQL, only B-tree
can produce sorted output — the other index types return
matching rows in an unspecified, implementation-dependent order.
The planner will consider satisfying an ORDER BY
specification
either by scanning an available index that matches the specification,
or by scanning the table in physical order and doing an explicit
sort. For a query that requires scanning a large fraction of the
table, an explicit sort is likely to be faster than using an index
because it requires
less disk I/O due to following a sequential access pattern. Indexes are
more useful when only a few rows need be fetched. An important
special case is ORDER BY
in combination with
LIMIT
n
: an explicit sort will have to process
all the data to identify the first n
rows, but if there is
an index matching the ORDER BY
, the first n
rows can be retrieved directly, without scanning the remainder at all.
By default, B-tree indexes store their entries in ascending order
with nulls last (table TID is treated as a tiebreaker column among
otherwise equal entries). This means that a forward scan of an
index on column x
produces output satisfying ORDER BY x
(or more verbosely, ORDER BY x ASC NULLS LAST
). The
index can also be scanned backward, producing output satisfying
ORDER BY x DESC
(or more verbosely, ORDER BY x DESC NULLS FIRST
, since
NULLS FIRST
is the default for ORDER BY DESC
).
You can adjust the ordering of a B-tree index by including the
options ASC
, DESC
, NULLS FIRST
,
and/or NULLS LAST
when creating the index; for example:
CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST); CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
An index stored in ascending order with nulls first can satisfy
either ORDER BY x ASC NULLS FIRST
or
ORDER BY x DESC NULLS LAST
depending on which direction
it is scanned in.
You might wonder why bother providing all four options, when two
options together with the possibility of backward scan would cover
all the variants of ORDER BY
. In single-column indexes
the options are indeed redundant, but in multicolumn indexes they can be
useful. Consider a two-column index on (x, y)
: this can
satisfy ORDER BY x, y
if we scan forward, or
ORDER BY x DESC, y DESC
if we scan backward.
But it might be that the application frequently needs to use
ORDER BY x ASC, y DESC
. There is no way to get that
ordering from a plain index, but it is possible if the index is defined
as (x ASC, y DESC)
or (x DESC, y ASC)
.
Obviously, indexes with non-default sort orderings are a fairly specialized feature, but sometimes they can produce tremendous speedups for certain queries. Whether it's worth maintaining such an index depends on how often you use queries that require a special sort ordering.
A single index scan can only use query clauses that use the index's
columns with operators of its operator class and are joined with
AND
. For example, given an index on (a, b)
a query condition like WHERE a = 5 AND b = 6
could
use the index, but a query like WHERE a = 5 OR b = 6
could not
directly use the index.
Fortunately,
PostgreSQL has the ability to combine multiple indexes
(including multiple uses of the same index) to handle cases that cannot
be implemented by single index scans. The system can form AND
and OR
conditions across several index scans. For example,
a query like WHERE x = 42 OR x = 47 OR x = 53 OR x = 99
could be broken down into four separate scans of an index on x
,
each scan using one of the query clauses. The results of these scans are
then ORed together to produce the result. Another example is that if we
have separate indexes on x
and y
, one possible
implementation of a query like WHERE x = 5 AND y = 6
is to
use each index with the appropriate query clause and then AND together
the index results to identify the result rows.
To combine multiple indexes, the system scans each needed index and
prepares a bitmap in memory giving the locations of
table rows that are reported as matching that index's conditions.
The bitmaps are then ANDed and ORed together as needed by the query.
Finally, the actual table rows are visited and returned. The table rows
are visited in physical order, because that is how the bitmap is laid
out; this means that any ordering of the original indexes is lost, and
so a separate sort step will be needed if the query has an ORDER
BY
clause. For this reason, and because each additional index scan
adds extra time, the planner will sometimes choose to use a simple index
scan even though additional indexes are available that could have been
used as well.
In all but the simplest applications, there are various combinations of
indexes that might be useful, and the database developer must make
trade-offs to decide which indexes to provide. Sometimes multicolumn
indexes are best, but sometimes it's better to create separate indexes
and rely on the index-combination feature. For example, if your
workload includes a mix of queries that sometimes involve only column
x
, sometimes only column y
, and sometimes both
columns, you might choose to create two separate indexes on
x
and y
, relying on index combination to
process the queries that use both columns. You could also create a
multicolumn index on (x, y)
. This index would typically be
more efficient than index combination for queries involving both
columns, but as discussed in Section 11.3, it
would be almost useless for queries involving only y
, so it
should not be the only index. A combination of the multicolumn index
and a separate index on y
would serve reasonably well. For
queries involving only x
, the multicolumn index could be
used, though it would be larger and hence slower than an index on
x
alone. The last alternative is to create all three
indexes, but this is probably only reasonable if the table is searched
much more often than it is updated and all three types of query are
common. If one of the types of query is much less common than the
others, you'd probably settle for creating just the two indexes that
best match the common types.
Indexes can also be used to enforce uniqueness of a column's value, or the uniqueness of the combined values of more than one column.
CREATE UNIQUE INDEXname
ONtable
(column
[, ...]);
Currently, only B-tree indexes can be declared unique.
When an index is declared unique, multiple table rows with equal indexed values are not allowed. Null values are not considered equal. A multicolumn unique index will only reject cases where all indexed columns are equal in multiple rows.
PostgreSQL automatically creates a unique index when a unique constraint or primary key is defined for a table. The index covers the columns that make up the primary key or unique constraint (a multicolumn index, if appropriate), and is the mechanism that enforces the constraint.
There's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index.
An index column need not be just a column of the underlying table, but can be a function or scalar expression computed from one or more columns of the table. This feature is useful to obtain fast access to tables based on the results of computations.
For example, a common way to do case-insensitive comparisons is to
use the lower
function:
SELECT * FROM test1 WHERE lower(col1) = 'value';
This query can use an index if one has been
defined on the result of the lower(col1)
function:
CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
If we were to declare this index UNIQUE
, it would prevent
creation of rows whose col1
values differ only in case,
as well as rows whose col1
values are actually identical.
Thus, indexes on expressions can be used to enforce constraints that
are not definable as simple unique constraints.
As another example, if one often does queries like:
SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';
then it might be worth creating an index like this:
CREATE INDEX people_names ON people ((first_name || ' ' || last_name));
The syntax of the CREATE INDEX
command normally requires
writing parentheses around index expressions, as shown in the second
example. The parentheses can be omitted when the expression is just
a function call, as in the first example.
Index expressions are relatively expensive to maintain, because the
derived expression(s) must be computed for each row insertion
and non-HOT update. However, the index expressions are
not recomputed during an indexed search, since they are
already stored in the index. In both examples above, the system
sees the query as just WHERE indexedcolumn = 'constant'
and so the speed of the search is equivalent to any other simple index
query. Thus, indexes on expressions are useful when retrieval speed
is more important than insertion and update speed.
A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate. Partial indexes are a specialized feature, but there are several situations in which they are useful.
One major reason for using a partial index is to avoid indexing common values. Since a query searching for a common value (one that accounts for more than a few percent of all the table rows) will not use the index anyway, there is no point in keeping those rows in the index at all. This reduces the size of the index, which will speed up those queries that do use the index. It will also speed up many table update operations because the index does not need to be updated in all cases. Example 11.1 shows a possible application of this idea.
Example 11.1. Setting up a Partial Index to Exclude Common Values
Suppose you are storing web server access logs in a database. Most accesses originate from the IP address range of your organization but some are from elsewhere (say, employees on dial-up connections). If your searches by IP are primarily for outside accesses, you probably do not need to index the IP range that corresponds to your organization's subnet.
Assume a table like this:
CREATE TABLE access_log ( url varchar, client_ip inet, ... );
To create a partial index that suits our example, use a command such as this:
CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255');
A typical query that can use this index would be:
SELECT * FROM access_log WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
Here the query's IP address is covered by the partial index. The following query cannot use the partial index, as it uses an IP address that is excluded from the index:
SELECT * FROM access_log WHERE url = '/index.html' AND client_ip = inet '192.168.100.23';
Observe that this kind of partial index requires that the common values be predetermined, so such partial indexes are best used for data distributions that do not change. Such indexes can be recreated occasionally to adjust for new data distributions, but this adds maintenance effort.
Another possible use for a partial index is to exclude values from the index that the typical query workload is not interested in; this is shown in Example 11.2. This results in the same advantages as listed above, but it prevents the “uninteresting” values from being accessed via that index, even if an index scan might be profitable in that case. Obviously, setting up partial indexes for this kind of scenario will require a lot of care and experimentation.
Example 11.2. Setting up a Partial Index to Exclude Uninteresting Values
If you have a table that contains both billed and unbilled orders, where the unbilled orders take up a small fraction of the total table and yet those are the most-accessed rows, you can improve performance by creating an index on just the unbilled rows. The command to create the index would look like this:
CREATE INDEX orders_unbilled_index ON orders (order_nr) WHERE billed is not true;
A possible query to use this index would be:
SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
However, the index can also be used in queries that do not involve
order_nr
at all, e.g.:
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
This is not as efficient as a partial index on the
amount
column would be, since the system has to
scan the entire index. Yet, if there are relatively few unbilled
orders, using this partial index just to find the unbilled orders
could be a win.
Note that this query cannot use this index:
SELECT * FROM orders WHERE order_nr = 3501;
The order 3501 might be among the billed or unbilled orders.
Example 11.2 also illustrates that the
indexed column and the column used in the predicate do not need to
match. PostgreSQL supports partial
indexes with arbitrary predicates, so long as only columns of the
table being indexed are involved. However, keep in mind that the
predicate must match the conditions used in the queries that
are supposed to benefit from the index. To be precise, a partial
index can be used in a query only if the system can recognize that
the WHERE
condition of the query mathematically implies
the predicate of the index.
PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not
only is such a general theorem prover extremely difficult to
create, it would probably be too slow to be of any real use.)
The system can recognize simple inequality implications, for example
“x < 1” implies “x < 2”; otherwise
the predicate condition must exactly match part of the query's
WHERE
condition
or the index will not be recognized as usable. Matching takes
place at query planning time, not at run time. As a result,
parameterized query clauses do not work with a partial index. For
example a prepared query with a parameter might specify
“x < ?” which will never imply
“x < 2” for all possible values of the parameter.
A third possible use for partial indexes does not require the index to be used in queries at all. The idea here is to create a unique index over a subset of a table, as in Example 11.3. This enforces uniqueness among the rows that satisfy the index predicate, without constraining those that do not.
Example 11.3. Setting up a Partial Unique Index
Suppose that we have a table describing test outcomes. We wish to ensure that there is only one “successful” entry for a given subject and target combination, but there might be any number of “unsuccessful” entries. Here is one way to do it:
CREATE TABLE tests ( subject text, target text, success boolean, ... ); CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) WHERE success;
This is a particularly efficient approach when there are few
successful tests and many unsuccessful ones. It is also possible to
allow only one null in a column by creating a unique partial index
with an IS NULL
restriction.
Finally, a partial index can also be used to override the system's query plan choices. Also, data sets with peculiar distributions might cause the system to use an index when it really should not. In that case the index can be set up so that it is not available for the offending query. Normally, PostgreSQL makes reasonable choices about index usage (e.g., it avoids them when retrieving common values, so the earlier example really only saves index size, it is not required to avoid index usage), and grossly incorrect plan choices are cause for a bug report.
Keep in mind that setting up a partial index indicates that you know at least as much as the query planner knows, in particular you know when an index might be profitable. Forming this knowledge requires experience and understanding of how indexes in PostgreSQL work. In most cases, the advantage of a partial index over a regular index will be minimal. There are cases where they are quite counterproductive, as in Example 11.4.
Example 11.4. Do Not Use Partial Indexes as a Substitute for Partitioning
You might be tempted to create a large set of non-overlapping partial indexes, for example
CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1; CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2; CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3; ... CREATE INDEX mytable_cat_N
ON mytable (data) WHERE category =N
;
This is a bad idea! Almost certainly, you'll be better off with a single non-partial index, declared like
CREATE INDEX mytable_cat_data ON mytable (category, data);
(Put the category column first, for the reasons described in Section 11.3.) While a search in this larger index might have to descend through a couple more tree levels than a search in a smaller index, that's almost certainly going to be cheaper than the planner effort needed to select the appropriate one of the partial indexes. The core of the problem is that the system does not understand the relationship among the partial indexes, and will laboriously test each one to see if it's applicable to the current query.
If your table is large enough that a single index really is a bad idea, you should look into using partitioning instead (see Section 5.11). With that mechanism, the system does understand that the tables and indexes are non-overlapping, so far better performance is possible.
More information about partial indexes can be found in [ston89b], [olson93], and [seshadri95].
All indexes in PostgreSQL
are secondary indexes, meaning that each index is
stored separately from the table's main data area (which is called the
table's heap
in PostgreSQL terminology). This means that
in an ordinary index scan, each row retrieval requires fetching data from
both the index and the heap. Furthermore, while the index entries that
match a given indexable WHERE
condition are usually
close together in the index, the table rows they reference might be
anywhere in the heap. The heap-access portion of an index scan thus
involves a lot of random access into the heap, which can be slow,
particularly on traditional rotating media. (As described in
Section 11.5, bitmap scans try to alleviate
this cost by doing the heap accesses in sorted order, but that only goes
so far.)
To solve this performance problem, PostgreSQL supports index-only scans, which can answer queries from an index alone without any heap access. The basic idea is to return values directly out of each index entry instead of consulting the associated heap entry. There are two fundamental restrictions on when this method can be used:
The index type must support index-only scans. B-tree indexes always do. GiST and SP-GiST indexes support index-only scans for some operator classes but not others. Other index types have no support. The underlying requirement is that the index must physically store, or else be able to reconstruct, the original data value for each index entry. As a counterexample, GIN indexes cannot support index-only scans because each index entry typically holds only part of the original data value.
The query must reference only columns stored in the index. For
example, given an index on columns x
and y
of a table that also has a
column z
, these queries could use index-only scans:
SELECT x, y FROM tab WHERE x = 'key'; SELECT x FROM tab WHERE x = 'key' AND y < 42;
but these queries could not:
SELECT x, z FROM tab WHERE x = 'key'; SELECT x FROM tab WHERE x = 'key' AND z < 42;
(Expression indexes and partial indexes complicate this rule, as discussed below.)
If these two fundamental requirements are met, then all the data values required by the query are available from the index, so an index-only scan is physically possible. But there is an additional requirement for any table scan in PostgreSQL: it must verify that each retrieved row be “visible” to the query's MVCC snapshot, as discussed in Chapter 13. Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently. However, for seldom-changing data there is a way around this problem. PostgreSQL tracks, for each page in a table's heap, whether all rows stored in that page are old enough to be visible to all current and future transactions. This information is stored in a bit in the table's visibility map. An index-only scan, after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If it's set, the row is known visible and so the data can be returned with no further work. If it's not set, the heap entry must be visited to find out whether it's visible, so no performance advantage is gained over a standard index scan. Even in the successful case, this approach trades visibility map accesses for heap accesses; but since the visibility map is four orders of magnitude smaller than the heap it describes, far less physical I/O is needed to access it. In most situations the visibility map remains cached in memory all the time.
In short, while an index-only scan is possible given the two fundamental requirements, it will be a win only if a significant fraction of the table's heap pages have their all-visible map bits set. But tables in which a large fraction of the rows are unchanging are common enough to make this type of scan very useful in practice.
To make effective use of the index-only scan feature, you might choose to
create a covering index, which is an index
specifically designed to include the columns needed by a particular
type of query that you run frequently. Since queries typically need to
retrieve more columns than just the ones they search
on, PostgreSQL allows you to create an index
in which some columns are just “payload” and are not part
of the search key. This is done by adding an INCLUDE
clause listing the extra columns. For example, if you commonly run
queries like
SELECT y FROM tab WHERE x = 'key';
the traditional approach to speeding up such queries would be to create
an index on x
only. However, an index defined as
CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);
could handle these queries as index-only scans,
because y
can be obtained from the index without
visiting the heap.
Because column y
is not part of the index's search
key, it does not have to be of a data type that the index can handle;
it's merely stored in the index and is not interpreted by the index
machinery. Also, if the index is a unique index, that is
CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);
the uniqueness condition applies to just column x
,
not to the combination of x
and y
.
(An INCLUDE
clause can also be written
in UNIQUE
and PRIMARY KEY
constraints, providing alternative syntax for setting up an index like
this.)
It's wise to be conservative about adding non-key payload columns to an index, especially wide columns. If an index tuple exceeds the maximum size allowed for the index type, data insertion will fail. In any case, non-key columns duplicate data from the index's table and bloat the size of the index, thus potentially slowing searches. And remember that there is little point in including payload columns in an index unless the table changes slowly enough that an index-only scan is likely to not need to access the heap. If the heap tuple must be visited anyway, it costs nothing more to get the column's value from there. Other restrictions are that expressions are not currently supported as included columns, and that only B-tree, GiST and SP-GiST indexes currently support included columns.
Before PostgreSQL had
the INCLUDE
feature, people sometimes made covering
indexes by writing the payload columns as ordinary index columns,
that is writing
CREATE INDEX tab_x_y ON tab(x, y);
even though they had no intention of ever using y
as
part of a WHERE
clause. This works fine as long as
the extra columns are trailing columns; making them be leading columns is
unwise for the reasons explained in Section 11.3.
However, this method doesn't support the case where you want the index to
enforce uniqueness on the key column(s).
Suffix truncation always removes non-key
columns from upper B-Tree levels. As payload columns, they are
never used to guide index scans. The truncation process also
removes one or more trailing key column(s) when the remaining
prefix of key column(s) happens to be sufficient to describe tuples
on the lowest B-Tree level. In practice, covering indexes without
an INCLUDE
clause often avoid storing columns
that are effectively payload in the upper levels. However,
explicitly defining payload columns as non-key columns
reliably keeps the tuples in upper levels
small.
In principle, index-only scans can be used with expression indexes.
For example, given an index on f(x)
where x
is a table column, it should be possible to
execute
SELECT f(x) FROM tab WHERE f(x) < 1;
as an index-only scan; and this is very attractive
if f()
is an expensive-to-compute function.
However, PostgreSQL's planner is currently not
very smart about such cases. It considers a query to be potentially
executable by index-only scan only when all columns
needed by the query are available from the index. In this
example, x
is not needed except in the
context f(x)
, but the planner does not notice that and
concludes that an index-only scan is not possible. If an index-only scan
seems sufficiently worthwhile, this can be worked around by
adding x
as an included column, for example
CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);
An additional caveat, if the goal is to avoid
recalculating f(x)
, is that the planner won't
necessarily match uses of f(x)
that aren't in
indexable WHERE
clauses to the index column. It will
usually get this right in simple queries such as shown above, but not in
queries that involve joins. These deficiencies may be remedied in future
versions of PostgreSQL.
Partial indexes also have interesting interactions with index-only scans. Consider the partial index shown in Example 11.3:
CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) WHERE success;
In principle, we could do an index-only scan on this index to satisfy a query like
SELECT target FROM tests WHERE subject = 'some-subject' AND success;
But there's a problem: the WHERE
clause refers
to success
which is not available as a result column
of the index. Nonetheless, an index-only scan is possible because the
plan does not need to recheck that part of the WHERE
clause at run time: all entries found in the index necessarily
have success = true
so this need not be explicitly
checked in the plan. PostgreSQL versions 9.6
and later will recognize such cases and allow index-only scans to be
generated, but older versions will not.
An index definition can specify an operator class for each column of an index.
CREATE INDEXname
ONtable
(column
opclass
[ (opclass_options
) ] [sort options
] [, ...]);
The operator class identifies the operators to be used by the index
for that column. For example, a B-tree index on the type int4
would use the int4_ops
class; this operator
class includes comparison functions for values of type int4
.
In practice the default operator class for the column's data type is
usually sufficient. The main reason for having operator classes is
that for some data types, there could be more than one meaningful
index behavior. For example, we might want to sort a complex-number data
type either by absolute value or by real part. We could do this by
defining two operator classes for the data type and then selecting
the proper class when making an index. The operator class determines
the basic sort ordering (which can then be modified by adding sort options
COLLATE
,
ASC
/DESC
and/or
NULLS FIRST
/NULLS LAST
).
There are also some built-in operator classes besides the default ones:
The operator classes text_pattern_ops
,
varchar_pattern_ops
, and
bpchar_pattern_ops
support B-tree indexes on
the types text
, varchar
, and
char
respectively. The
difference from the default operator classes is that the values
are compared strictly character by character rather than
according to the locale-specific collation rules. This makes
these operator classes suitable for use by queries involving
pattern matching expressions (LIKE
or POSIX
regular expressions) when the database does not use the standard
“C” locale. As an example, you might index a
varchar
column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
Note that you should also create an index with the default operator
class if you want queries involving ordinary <
,
<=
, >
, or >=
comparisons
to use an index. Such queries cannot use the
operator classes. (Ordinary equality comparisons can use these
operator classes, however.) It is possible to create multiple
indexes on the same column with different operator classes.
If you do use the C locale, you do not need the
xxx
_pattern_ops
operator classes, because an index with the default operator class
is usable for pattern-matching queries in the C locale.
xxx
_pattern_ops
The following query shows all defined operator classes:
SELECT am.amname AS index_method, opc.opcname AS opclass_name, opc.opcintype::regtype AS indexed_type, opc.opcdefault AS is_default FROM pg_am am, pg_opclass opc WHERE opc.opcmethod = am.oid ORDER BY index_method, opclass_name;
An operator class is actually just a subset of a larger structure called an operator family. In cases where several data types have similar behaviors, it is frequently useful to define cross-data-type operators and allow these to work with indexes. To do this, the operator classes for each of the types must be grouped into the same operator family. The cross-type operators are members of the family, but are not associated with any single class within the family.
This expanded version of the previous query shows the operator family each operator class belongs to:
SELECT am.amname AS index_method, opc.opcname AS opclass_name, opf.opfname AS opfamily_name, opc.opcintype::regtype AS indexed_type, opc.opcdefault AS is_default FROM pg_am am, pg_opclass opc, pg_opfamily opf WHERE opc.opcmethod = am.oid AND opc.opcfamily = opf.oid ORDER BY index_method, opclass_name;
This query shows all defined operator families and all the operators included in each family:
SELECT am.amname AS index_method, opf.opfname AS opfamily_name, amop.amopopr::regoperator AS opfamily_operator FROM pg_am am, pg_opfamily opf, pg_amop amop WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid ORDER BY index_method, opfamily_name, opfamily_operator;
psql has
commands \dAc
, \dAf
,
and \dAo
, which provide slightly more sophisticated
versions of these queries.
An index can support only one collation per index column. If multiple collations are of interest, multiple indexes may be needed.
Consider these statements:
CREATE TABLE test1c ( id integer, content varchar COLLATE "x" ); CREATE INDEX test1c_content_index ON test1c (content);
The index automatically uses the collation of the underlying column. So a query of the form
SELECT * FROM test1c WHERE content > constant
;
could use the index, because the comparison will by default use the collation of the column. However, this index cannot accelerate queries that involve some other collation. So if queries of the form, say,
SELECT * FROM test1c WHERE content > constant
COLLATE "y";
are also of interest, an additional index could be created that supports
the "y"
collation, like this:
CREATE INDEX test1c_content_y_index ON test1c (content COLLATE "y");
Although indexes in PostgreSQL do not need maintenance or tuning, it is still important to check which indexes are actually used by the real-life query workload. Examining index usage for an individual query is done with the EXPLAIN command; its application for this purpose is illustrated in Section 14.1. It is also possible to gather overall statistics about index usage in a running server, as described in Section 28.2.
It is difficult to formulate a general procedure for determining which indexes to create. There are a number of typical cases that have been shown in the examples throughout the previous sections. A good deal of experimentation is often necessary. The rest of this section gives some tips for that:
Always run ANALYZE
first. This command
collects statistics about the distribution of the values in the
table. This information is required to estimate the number of rows
returned by a query, which is needed by the planner to assign
realistic costs to each possible query plan. In absence of any
real statistics, some default values are assumed, which are
almost certain to be inaccurate. Examining an application's
index usage without having run ANALYZE
is
therefore a lost cause.
See Section 25.1.3
and Section 25.1.6 for more information.
Use real data for experimentation. Using test data for setting up indexes will tell you what indexes you need for the test data, but that is all.
It is especially fatal to use very small test data sets. While selecting 1000 out of 100000 rows could be a candidate for an index, selecting 1 out of 100 rows will hardly be, because the 100 rows probably fit within a single disk page, and there is no plan that can beat sequentially fetching 1 disk page.
Also be careful when making up test data, which is often unavoidable when the application is not yet in production. Values that are very similar, completely random, or inserted in sorted order will skew the statistics away from the distribution that real data would have.
When indexes are not used, it can be useful for testing to force
their use. There are run-time parameters that can turn off
various plan types (see Section 20.7.1).
For instance, turning off sequential scans
(enable_seqscan
) and nested-loop joins
(enable_nestloop
), which are the most basic plans,
will force the system to use a different plan. If the system
still chooses a sequential scan or nested-loop join then there is
probably a more fundamental reason why the index is not being
used; for example, the query condition does not match the index.
(What kind of query can use what kind of index is explained in
the previous sections.)
If forcing index usage does use the index, then there are two
possibilities: Either the system is right and using the index is
indeed not appropriate, or the cost estimates of the query plans
are not reflecting reality. So you should time your query with
and without indexes. The EXPLAIN ANALYZE
command can be useful here.
If it turns out that the cost estimates are wrong, there are, again, two possibilities. The total cost is computed from the per-row costs of each plan node times the selectivity estimate of the plan node. The costs estimated for the plan nodes can be adjusted via run-time parameters (described in Section 20.7.2). An inaccurate selectivity estimate is due to insufficient statistics. It might be possible to improve this by tuning the statistics-gathering parameters (see ALTER TABLE).
If you do not succeed in adjusting the costs to be more appropriate, then you might have to resort to forcing index usage explicitly. You might also want to contact the PostgreSQL developers to examine the issue.
Table of Contents
Full Text Searching (or just text search) provides
the capability to identify natural-language documents that
satisfy a query, and optionally to sort them by
relevance to the query. The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query. Notions of query
and
similarity
are very flexible and depend on the specific
application. The simplest search considers query
as a
set of words and similarity
as the frequency of query
words in the document.
Textual search operators have existed in databases for years.
PostgreSQL has
~
, ~*
, LIKE
, and
ILIKE
operators for textual data types, but they lack
many essential properties required by modern information systems:
There is no linguistic support, even for English. Regular expressions
are not sufficient because they cannot easily handle derived words, e.g.,
satisfies
and satisfy
. You might
miss documents that contain satisfies
, although you
probably would like to find them when searching for
satisfy
. It is possible to use OR
to search for multiple derived forms, but this is tedious and error-prone
(some words can have several thousand derivatives).
They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
They tend to be slow because there is no index support, so they must process all documents for every search.
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. In principle token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes. PostgreSQL uses a parser to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
Converting tokens into lexemes.
A lexeme is a string, just like a token, but it has been
normalized so that different forms of the same word
are made alike. For example, normalization almost always includes
folding upper-case letters to lower-case, and often involves removal
of suffixes (such as s
or es
in English).
This allows searches to find variant forms of the
same word, without tediously entering all the possible variants.
Also, this step typically eliminates stop words, which
are words that are so common that they are useless for searching.
(In short, then, tokens are raw fragments of the document text, while
lexemes are words that are believed useful for indexing and searching.)
PostgreSQL uses dictionaries to
perform this step. Various standard dictionaries are provided, and
custom ones can be created for specific needs.
Storing preprocessed documents optimized for searching. For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes it is often desirable to store positional information to use for proximity ranking, so that a document that contains a more “dense” region of query words is assigned a higher rank than one with scattered query words.
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
Define stop words that should not be indexed.
Map synonyms to a single word using Ispell.
Map phrases to a single word using a thesaurus.
Map different variations of a word to a canonical form using an Ispell dictionary.
Map different variations of a word to a canonical form using Snowball stemmer rules.
A data type tsvector
is provided for storing preprocessed
documents, along with a type tsquery
for representing processed
queries (Section 8.11). There are many
functions and operators available for these data types
(Section 9.13), the most important of which is
the match operator @@
, which we introduce in
Section 12.1.2. Full text searches can be accelerated
using indexes (Section 12.9).
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document FROM messages WHERE mid = 12; SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document FROM messages m, docs d WHERE m.mid = d.did AND m.mid = 12;
Actually, in these example queries, coalesce
should be used to prevent a single NULL
attribute from
causing a NULL
result for the whole document.
Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data inside PostgreSQL. Also, keeping everything inside the database allows easy access to document metadata to assist in indexing and display.
For text search purposes, each document must be reduced to the
preprocessed tsvector
format. Searching and ranking
are performed entirely on the tsvector
representation
of a document — the original text need only be retrieved
when the document has been selected for display to a user.
We therefore often speak of the tsvector
as being the
document, but of course it is only a compact representation of
the full document.
Full text searching in PostgreSQL is based on
the match operator @@
, which returns
true
if a tsvector
(document) matches a tsquery
(query).
It doesn't matter which data type is written first:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery; ?column? ---------- t SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; ?column? ---------- f
As the above example suggests, a tsquery
is not just raw
text, any more than a tsvector
is. A tsquery
contains search terms, which must be already-normalized lexemes, and
may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
(For syntax details see Section 8.11.2.) There are
functions to_tsquery
, plainto_tsquery
,
and phraseto_tsquery
that are helpful in converting user-written text into a proper
tsquery
, primarily by normalizing words appearing in
the text. Similarly, to_tsvector
is used to parse and
normalize a document string. So in practice a text search match would
look more like this:
SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat'); ?column? ---------- t
Observe that this match would not succeed if written as
SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat'); ?column? ---------- f
since here no normalization of the word rats
will occur.
The elements of a tsvector
are lexemes, which are assumed
already normalized, so rats
does not match rat
.
The @@
operator also
supports text
input, allowing explicit conversion of a text
string to tsvector
or tsquery
to be skipped
in simple cases. The variants available are:
tsvector @@ tsquery tsquery @@ tsvector text @@ tsquery text @@ text
The first two of these we saw already.
The form text
@@
tsquery
is equivalent to to_tsvector(x) @@ y
.
The form text
@@
text
is equivalent to to_tsvector(x) @@ plainto_tsquery(y)
.
Within a tsquery
, the &
(AND) operator
specifies that both its arguments must appear in the document to have a
match. Similarly, the |
(OR) operator specifies that
at least one of its arguments must appear, while the !
(NOT)
operator specifies that its argument must not appear in
order to have a match.
For example, the query fat & ! rat
matches documents that
contain fat
but not rat
.
Searching for phrases is possible with the help of
the <->
(FOLLOWED BY) tsquery
operator, which
matches only if its arguments have matches that are adjacent and in the
given order. For example:
SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error'); ?column? ---------- t SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error'); ?column? ---------- f
There is a more general version of the FOLLOWED BY operator having the
form <
,
where N
>N
is an integer standing for the difference between
the positions of the matching lexemes. <1>
is
the same as <->
, while <2>
allows exactly one other lexeme to appear between the matches, and so
on. The phraseto_tsquery
function makes use of this
operator to construct a tsquery
that can match a multi-word
phrase when some of the words are stop words. For example:
SELECT phraseto_tsquery('cats ate rats'); phraseto_tsquery ------------------------------- 'cat' <-> 'ate' <-> 'rat' SELECT phraseto_tsquery('the cats ate the rats'); phraseto_tsquery ------------------------------- 'cat' <-> 'ate' <2> 'rat'
A special case that's sometimes useful is that <0>
can be used to require that two patterns match the same word.
Parentheses can be used to control nesting of the tsquery
operators. Without parentheses, |
binds least tightly,
then &
, then <->
,
and !
most tightly.
It's worth noticing that the AND/OR/NOT operators mean something subtly
different when they are within the arguments of a FOLLOWED BY operator
than when they are not, because within FOLLOWED BY the exact position of
the match is significant. For example, normally !x
matches
only documents that do not contain x
anywhere.
But !x <-> y
matches y
if it is not
immediately after an x
; an occurrence of x
elsewhere in the document does not prevent a match. Another example is
that x & y
normally only requires that x
and y
both appear somewhere in the document, but
(x & y) <-> z
requires x
and y
to match at the same place, immediately before
a z
. Thus this query behaves differently from
x <-> z & y <-> z
, which will match a
document containing two separate sequences x z
and
y z
. (This specific query is useless as written,
since x
and y
could not match at the same place;
but with more complex situations such as prefix-match patterns, a query
of this form could be useful.)
The above are all simple text search examples. As mentioned before, full
text search functionality includes the ability to do many more things:
skip indexing certain words (stop words), process synonyms, and use
sophisticated parsing, e.g., parse based on more than just white space.
This functionality is controlled by text search
configurations. PostgreSQL comes with predefined
configurations for many languages, and you can easily create your own
configurations. (psql's \dF
command
shows all available configurations.)
During installation an appropriate configuration is selected and
default_text_search_config is set accordingly
in postgresql.conf
. If you are using the same text search
configuration for the entire cluster you can use the value in
postgresql.conf
. To use different configurations
throughout the cluster but the same configuration within any one database,
use ALTER DATABASE ... SET
. Otherwise, you can set
default_text_search_config
in each session.
Each text search function that depends on a configuration has an optional
regconfig
argument, so that the configuration to use can be
specified explicitly. default_text_search_config
is used only when this argument is omitted.
To make it easier to build custom text search configurations, a configuration is built up from simpler database objects. PostgreSQL's text search facility provides four types of configuration-related database objects:
Text search parsers break documents into tokens and classify each token (for example, as words or numbers).
Text search dictionaries convert tokens to normalized form and reject stop words.
Text search templates provide the functions underlying dictionaries. (A dictionary simply specifies a template and a set of parameters for the template.)
Text search configurations select a parser and a set of dictionaries to use to normalize the tokens produced by the parser.
Text search parsers and templates are built from low-level C functions;
therefore it requires C programming ability to develop new ones, and
superuser privileges to install one into a database. (There are examples
of add-on parsers and templates in the contrib/
area of the
PostgreSQL distribution.) Since dictionaries and
configurations just parameterize and connect together some underlying
parsers and templates, no special privilege is needed to create a new
dictionary or configuration. Examples of creating custom dictionaries and
configurations appear later in this chapter.
The examples in the previous section illustrated full text matching using simple constant strings. This section shows how to search table data, optionally using indexes.
It is possible to do a full text search without an index. A simple query
to print the title
of each row that contains the word
friend
in its body
field is:
SELECT title FROM pgweb WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
This will also find related words such as friends
and friendly
, since all these are reduced to the same
normalized lexeme.
The query above specifies that the english
configuration
is to be used to parse and normalize the strings. Alternatively we
could omit the configuration parameters:
SELECT title FROM pgweb WHERE to_tsvector(body) @@ to_tsquery('friend');
This query will use the configuration set by default_text_search_config.
A more complex example is to
select the ten most recent documents that contain create
and
table
in the title
or body
:
SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
For clarity we omitted the coalesce
function calls
which would be needed to find rows that contain NULL
in one of the two fields.
Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index.
We can create a GIN index (Section 12.9) to speed up text searches:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
Notice that the 2-argument version of to_tsvector
is
used. Only text search functions that specify a configuration name can
be used in expression indexes (Section 11.7).
This is because the index contents must be unaffected by default_text_search_config. If they were affected, the
index contents might be inconsistent because different entries could
contain tsvector
s that were created with different text search
configurations, and there would be no way to guess which was which. It
would be impossible to dump and restore such an index correctly.
Because the two-argument version of to_tsvector
was
used in the index above, only a query reference that uses the 2-argument
version of to_tsvector
with the same configuration
name will use that index. That is, WHERE
to_tsvector('english', body) @@ 'a & b'
can use the index,
but WHERE to_tsvector(body) @@ 'a & b'
cannot.
This ensures that an index will be used only with the same configuration
used to create the index entries.
It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body));
where config_name
is a column in the pgweb
table. This allows mixed configurations in the same index while
recording which configuration was used for each index entry. This
would be useful, for example, if the document collection contained
documents in different languages. Again,
queries that are meant to use the index must be phrased to match, e.g.,
WHERE to_tsvector(config_name, body) @@ 'a & b'
.
Indexes can even concatenate columns:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body));
Another approach is to create a separate tsvector
column
to hold the output of to_tsvector
. To keep this
column automatically up to date with its source data, use a stored
generated column. This example is a
concatenation of title
and body
,
using coalesce
to ensure that one field will still be
indexed when the other is NULL
:
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
Then we create a GIN index to speed up the search:
CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);
Now we are ready to perform a fast full text search:
SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
One advantage of the separate-column approach over an expression index
is that it is not necessary to explicitly specify the text search
configuration in queries in order to make use of the index. As shown
in the example above, the query can depend on
default_text_search_config
. Another advantage is that
searches will be faster, since it will not be necessary to redo the
to_tsvector
calls to verify index matches. (This is more
important when using a GiST index than a GIN index; see Section 12.9.) The expression-index approach is
simpler to set up, however, and it requires less disk space since the
tsvector
representation is not stored explicitly.
To implement full text searching there must be a function to create a
tsvector
from a document and a tsquery
from a
user query. Also, we need to return results in a useful order, so we need
a function that compares documents with respect to their relevance to
the query. It's also important to be able to display the results nicely.
PostgreSQL provides support for all of these
functions.
PostgreSQL provides the
function to_tsvector
for converting a document to
the tsvector
data type.
to_tsvector([config
regconfig
, ]document
text
) returnstsvector
to_tsvector
parses a textual document into tokens,
reduces the tokens to lexemes, and returns a tsvector
which
lists the lexemes together with their positions in the document.
The document is processed according to the specified or default
text search configuration.
Here is a simple example:
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); to_tsvector ----------------------------------------------------- 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
In the example above we see that the resulting tsvector
does not
contain the words a
, on
, or
it
, the word rats
became
rat
, and the punctuation sign -
was
ignored.
The to_tsvector
function internally calls a parser
which breaks the document text into tokens and assigns a type to
each token. For each token, a list of
dictionaries (Section 12.6) is consulted,
where the list can vary depending on the token type. The first dictionary
that recognizes the token emits one or more normalized
lexemes to represent the token. For example,
rats
became rat
because one of the
dictionaries recognized that the word rats
is a plural
form of rat
. Some words are recognized as
stop words (Section 12.6.1), which
causes them to be ignored since they occur too frequently to be useful in
searching. In our example these are
a
, on
, and it
.
If no dictionary in the list recognizes the token then it is also ignored.
In this example that happened to the punctuation sign -
because there are in fact no dictionaries assigned for its token type
(Space symbols
), meaning space tokens will never be
indexed. The choices of parser, dictionaries and which types of tokens to
index are determined by the selected text search configuration (Section 12.7). It is possible to have
many different configurations in the same database, and predefined
configurations are available for various languages. In our example
we used the default configuration english
for the
English language.
The function setweight
can be used to label the
entries of a tsvector
with a given weight,
where a weight is one of the letters A
, B
,
C
, or D
.
This is typically used to mark entries coming from
different parts of a document, such as title versus body. Later, this
information can be used for ranking of search results.
Because to_tsvector
(NULL
) will
return NULL
, it is recommended to use
coalesce
whenever a field might be null.
Here is the recommended method for creating
a tsvector
from a structured document:
UPDATE tt SET ti = setweight(to_tsvector(coalesce(title,'')), 'A') || setweight(to_tsvector(coalesce(keyword,'')), 'B') || setweight(to_tsvector(coalesce(abstract,'')), 'C') || setweight(to_tsvector(coalesce(body,'')), 'D');
Here we have used setweight
to label the source
of each lexeme in the finished tsvector
, and then merged
the labeled tsvector
values using the tsvector
concatenation operator ||
. (Section 12.4.1 gives details about these
operations.)
PostgreSQL provides the
functions to_tsquery
,
plainto_tsquery
,
phraseto_tsquery
and
websearch_to_tsquery
for converting a query to the tsquery
data type.
to_tsquery
offers access to more features
than either plainto_tsquery
or
phraseto_tsquery
, but it is less forgiving about its
input. websearch_to_tsquery
is a simplified version
of to_tsquery
with an alternative syntax, similar
to the one used by web search engines.
to_tsquery([config
regconfig
, ]querytext
text
) returnstsquery
to_tsquery
creates a tsquery
value from
querytext
, which must consist of single tokens
separated by the tsquery
operators &
(AND),
|
(OR), !
(NOT), and
<->
(FOLLOWED BY), possibly grouped
using parentheses. In other words, the input to
to_tsquery
must already follow the general rules for
tsquery
input, as described in Section 8.11.2. The difference is that while basic
tsquery
input takes the tokens at face value,
to_tsquery
normalizes each token into a lexeme using
the specified or default configuration, and discards any tokens that are
stop words according to the configuration. For example:
SELECT to_tsquery('english', 'The & Fat & Rats'); to_tsquery --------------- 'fat' & 'rat'
As in basic tsquery
input, weight(s) can be attached to each
lexeme to restrict it to match only tsvector
lexemes of those
weight(s). For example:
SELECT to_tsquery('english', 'Fat | Rats:AB'); to_tsquery ------------------ 'fat' | 'rat':AB
Also, *
can be attached to a lexeme to specify prefix matching:
SELECT to_tsquery('supern:*A & star:A*B'); to_tsquery -------------------------- 'supern':*A & 'star':*AB
Such a lexeme will match any word in a tsvector
that begins
with the given string.
to_tsquery
can also accept single-quoted
phrases. This is primarily useful when the configuration includes a
thesaurus dictionary that may trigger on such phrases.
In the example below, a thesaurus contains the rule supernovae
stars : sn
:
SELECT to_tsquery('''supernovae stars'' & !crab'); to_tsquery --------------- 'sn' & !'crab'
Without quotes, to_tsquery
will generate a syntax
error for tokens that are not separated by an AND, OR, or FOLLOWED BY
operator.
plainto_tsquery([config
regconfig
, ]querytext
text
) returnstsquery
plainto_tsquery
transforms the unformatted text
querytext
to a tsquery
value.
The text is parsed and normalized much as for to_tsvector
,
then the &
(AND) tsquery
operator is
inserted between surviving words.
Example:
SELECT plainto_tsquery('english', 'The Fat Rats'); plainto_tsquery ----------------- 'fat' & 'rat'
Note that plainto_tsquery
will not
recognize tsquery
operators, weight labels,
or prefix-match labels in its input:
SELECT plainto_tsquery('english', 'The Fat & Rats:C'); plainto_tsquery --------------------- 'fat' & 'rat' & 'c'
Here, all the input punctuation was discarded.
phraseto_tsquery([config
regconfig
, ]querytext
text
) returnstsquery
phraseto_tsquery
behaves much like
plainto_tsquery
, except that it inserts
the <->
(FOLLOWED BY) operator between
surviving words instead of the &
(AND) operator.
Also, stop words are not simply discarded, but are accounted for by
inserting <
operators rather
than N
><->
operators. This function is useful
when searching for exact lexeme sequences, since the FOLLOWED BY
operators check lexeme order not just the presence of all the lexemes.
Example:
SELECT phraseto_tsquery('english', 'The Fat Rats'); phraseto_tsquery ------------------ 'fat' <-> 'rat'
Like plainto_tsquery
, the
phraseto_tsquery
function will not
recognize tsquery
operators, weight labels,
or prefix-match labels in its input:
SELECT phraseto_tsquery('english', 'The Fat & Rats:C'); phraseto_tsquery ----------------------------- 'fat' <-> 'rat' <-> 'c'
websearch_to_tsquery([config
regconfig
, ]querytext
text
) returnstsquery
websearch_to_tsquery
creates a tsquery
value from querytext
using an alternative
syntax in which simple unformatted text is a valid query.
Unlike plainto_tsquery
and phraseto_tsquery
, it also recognizes certain
operators. Moreover, this function will never raise syntax errors,
which makes it possible to use raw user-supplied input for search.
The following syntax is supported:
unquoted text
: text not inside quote marks will be
converted to terms separated by &
operators, as
if processed by plainto_tsquery
.
"quoted text"
: text inside quote marks will be
converted to terms separated by <->
operators, as if processed by phraseto_tsquery
.
OR
: the word “or” will be converted to
the |
operator.
-
: a dash will be converted to
the !
operator.
Other punctuation is ignored. So
like plainto_tsquery
and phraseto_tsquery
,
the websearch_to_tsquery
function will not
recognize tsquery
operators, weight labels, or prefix-match
labels in its input.
Examples:
SELECT websearch_to_tsquery('english', 'The fat rats'); websearch_to_tsquery ---------------------- 'fat' & 'rat' (1 row) SELECT websearch_to_tsquery('english', '"supernovae stars" -crab'); websearch_to_tsquery ---------------------------------- 'supernova' <-> 'star' & !'crab' (1 row) SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"'); websearch_to_tsquery ----------------------------------- 'sad' <-> 'cat' | 'fat' <-> 'rat' (1 row) SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"'); websearch_to_tsquery --------------------------------------- 'signal' & !( 'segment' <-> 'fault' ) (1 row) SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->'); websearch_to_tsquery ---------------------- 'dummi' & 'queri' (1 row)
Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
The two ranking functions currently available are:
ts_rank([ weights
float4[]
, ] vector
tsvector
, query
tsquery
[, normalization
integer
]) returns float4
Ranks vectors based on the frequency of their matching lexemes.
ts_rank_cd([ weights
float4[]
, ] vector
tsvector
, query
tsquery
[, normalization
integer
]) returns float4
This function computes the cover density
ranking for the given document vector and query, as described in
Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
Term Queries" in the journal "Information Processing and Management",
1999. Cover density is similar to ts_rank
ranking
except that the proximity of matching lexemes to each other is
taken into consideration.
This function requires lexeme positional information to perform
its calculation. Therefore, it ignores any “stripped”
lexemes in the tsvector
. If there are no unstripped
lexemes in the input, the result will be zero. (See Section 12.4.1 for more information
about the strip
function and positional information
in tsvector
s.)
For both these functions,
the optional weights
argument offers the ability to weigh word instances more or less
heavily depending on how they are labeled. The weight arrays specify
how heavily to weigh each category of word, in the order:
{D-weight, C-weight, B-weight, A-weight}
If no weights
are provided,
then these defaults are used:
{0.1, 0.2, 0.4, 1.0}
Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.
Since a longer document has a greater chance of containing a query term
it is reasonable to take into account document size, e.g., a hundred-word
document with five instances of a search word is probably more relevant
than a thousand-word document with five instances. Both ranking functions
take an integer normalization
option that
specifies whether and how a document's length should impact its rank.
The integer option controls several behaviors, so it is a bit mask:
you can specify one or more behaviors using
|
(for example, 2|4
).
0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length
2 divides the rank by the document length
4 divides the rank by the mean harmonic distance between extents
(this is implemented only by ts_rank_cd
)
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + the logarithm of the number of unique words in document
32 divides the rank by itself + 1
If more than one flag bit is specified, the transformations are applied in the order listed.
It is important to note that the ranking functions do not use any global
information, so it is impossible to produce a fair normalization to 1% or
100% as sometimes desired. Normalization option 32
(rank/(rank+1)
) can be applied to scale all ranks
into the range zero to one, but of course this is just a cosmetic change;
it will not affect the ordering of the search results.
Here is an example that selects only the ten highest-ranked matches:
SELECT title, ts_rank_cd(textsearch, query) AS rank FROM apod, to_tsquery('neutrino|(dark & matter)') query WHERE query @@ textsearch ORDER BY rank DESC LIMIT 10; title | rank -----------------------------------------------+---------- Neutrinos in the Sun | 3.1 The Sudbury Neutrino Detector | 2.4 A MACHO View of Galactic Dark Matter | 2.01317 Hot Gas and Dark Matter | 1.91171 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 Rafting for Solar Neutrinos | 1.9 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 Hot Gas and Dark Matter | 1.6123 Ice Fishing for Cosmic Neutrinos | 1.6 Weak Lensing Distorts the Universe | 0.818218
This is the same example using normalized ranking:
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank FROM apod, to_tsquery('neutrino|(dark & matter)') query WHERE query @@ textsearch ORDER BY rank DESC LIMIT 10; title | rank -----------------------------------------------+------------------- Neutrinos in the Sun | 0.756097569485493 The Sudbury Neutrino Detector | 0.705882361190954 A MACHO View of Galactic Dark Matter | 0.668123210574724 Hot Gas and Dark Matter | 0.65655958650282 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 Rafting for Solar Neutrinos | 0.655172410958162 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 Hot Gas and Dark Matter | 0.617195790024749 Ice Fishing for Cosmic Neutrinos | 0.615384618911517 Weak Lensing Distorts the Universe | 0.450010798361481
Ranking can be expensive since it requires consulting the
tsvector
of each matching document, which can be I/O bound and
therefore slow. Unfortunately, it is almost impossible to avoid since
practical queries often result in large numbers of matches.
To present search results it is ideal to show a part of each document and
how it is related to the query. Usually, search engines show fragments of
the document with marked search terms. PostgreSQL
provides a function ts_headline
that
implements this functionality.
ts_headline([config
regconfig
, ]document
text
,query
tsquery
[,options
text
]) returnstext
ts_headline
accepts a document along
with a query, and returns an excerpt from
the document in which terms from the query are highlighted. The
configuration to be used to parse the document can be specified by
config
; if config
is omitted, the
default_text_search_config
configuration is used.
If an options
string is specified it must
consist of a comma-separated list of one or more
option
=
value
pairs.
The available options are:
MaxWords
, MinWords
(integers):
these numbers determine the longest and shortest headlines to output.
The default values are 35 and 15.
ShortWord
(integer): words of this length or less
will be dropped at the start and end of a headline, unless they are
query terms. The default value of three eliminates common English
articles.
HighlightAll
(boolean): if
true
the whole document will be used as the
headline, ignoring the preceding three parameters. The default
is false
.
MaxFragments
(integer): maximum number of text
fragments to display. The default value of zero selects a
non-fragment-based headline generation method. A value greater
than zero selects fragment-based headline generation (see below).
StartSel
, StopSel
(strings):
the strings with which to delimit query words appearing in the
document, to distinguish them from other excerpted words. The
default values are “<b>
” and
“</b>
”, which can be suitable
for HTML output.
FragmentDelimiter
(string): When more than one
fragment is displayed, the fragments will be separated by this string.
The default is “ ...
”.
These option names are recognized case-insensitively. You must double-quote string values if they contain spaces or commas.
In non-fragment-based headline
generation, ts_headline
locates matches for the
given query
and chooses a
single one to display, preferring matches that have more query words
within the allowed headline length.
In fragment-based headline generation, ts_headline
locates the query matches and splits each match
into “fragments” of no more than MaxWords
words each, preferring fragments with more query words, and when
possible “stretching” fragments to include surrounding
words. The fragment-based mode is thus more useful when the query
matches span large sections of the document, or when it's desirable to
display multiple matches.
In either mode, if no query matches can be identified, then a single
fragment of the first MinWords
words in the document
will be displayed.
For example:
SELECT ts_headline('english', 'The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query.', to_tsquery('english', 'query & similarity')); ts_headline ------------------------------------------------------------ containing given <b>query</b> terms + and return them in order of their <b>similarity</b> to the+ <b>query</b>. SELECT ts_headline('english', 'Search terms may occur many times in a document, requiring ranking of the search matches to decide which occurrences to display in the result.', to_tsquery('english', 'search & term'), 'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>'); ts_headline ------------------------------------------------------------ <<Search>> <<terms>> may occur + many times ... ranking of the <<search>> matches to decide
ts_headline
uses the original document, not a
tsvector
summary, so it can be slow and should be used with
care.
This section describes additional functions and operators that are useful in connection with text search.
Section 12.3.1 showed how raw textual
documents can be converted into tsvector
values.
PostgreSQL also provides functions and
operators that can be used to manipulate documents that are already
in tsvector
form.
tsvector
|| tsvector
The tsvector
concatenation operator
returns a vector which combines the lexemes and positional information
of the two vectors given as arguments. Positions and weight labels
are retained during the concatenation.
Positions appearing in the right-hand vector are offset by the largest
position mentioned in the left-hand vector, so that the result is
nearly equivalent to the result of performing to_tsvector
on the concatenation of the two original document strings. (The
equivalence is not exact, because any stop-words removed from the
end of the left-hand argument will not affect the result, whereas
they would have affected the positions of the lexemes in the
right-hand argument if textual concatenation were used.)
One advantage of using concatenation in the vector form, rather than
concatenating text before applying to_tsvector
, is that
you can use different configurations to parse different sections
of the document. Also, because the setweight
function
marks all lexemes of the given vector the same way, it is necessary
to parse the text and do setweight
before concatenating
if you want to label different parts of the document with different
weights.
setweight(vector
tsvector
, weight
"char"
) returns tsvector
setweight
returns a copy of the input vector in which every
position has been labeled with the given weight
, either
A
, B
, C
, or
D
. (D
is the default for new
vectors and as such is not displayed on output.) These labels are
retained when vectors are concatenated, allowing words from different
parts of a document to be weighted differently by ranking functions.
Note that weight labels apply to positions, not
lexemes. If the input vector has been stripped of
positions then setweight
does nothing.
length(vector
tsvector
) returns integer
Returns the number of lexemes stored in the vector.
strip(vector
tsvector
) returns tsvector
Returns a vector that lists the same lexemes as the given vector, but
lacks any position or weight information. The result is usually much
smaller than an unstripped vector, but it is also less useful.
Relevance ranking does not work as well on stripped vectors as
unstripped ones. Also,
the <->
(FOLLOWED BY) tsquery
operator
will never match stripped input, since it cannot determine the
distance between lexeme occurrences.
A full list of tsvector
-related functions is available
in Table 9.42.
Section 12.3.2 showed how raw textual
queries can be converted into tsquery
values.
PostgreSQL also provides functions and
operators that can be used to manipulate queries that are already
in tsquery
form.
tsquery
&& tsquery
Returns the AND-combination of the two given queries.
tsquery
|| tsquery
Returns the OR-combination of the two given queries.
!! tsquery
Returns the negation (NOT) of the given query.
tsquery
<-> tsquery
Returns a query that searches for a match to the first given query
immediately followed by a match to the second given query, using
the <->
(FOLLOWED BY)
tsquery
operator. For example:
SELECT to_tsquery('fat') <-> to_tsquery('cat | rat'); ?column? ---------------------------- 'fat' <-> ( 'cat' | 'rat' )
tsquery_phrase(query1
tsquery
, query2
tsquery
[, distance
integer
]) returns tsquery
Returns a query that searches for a match to the first given query
followed by a match to the second given query at a distance of exactly
distance
lexemes, using
the <
N
>tsquery
operator. For example:
SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10); tsquery_phrase ------------------ 'fat' <10> 'cat'
numnode(query
tsquery
) returns integer
Returns the number of nodes (lexemes plus operators) in a
tsquery
. This function is useful
to determine if the query
is meaningful
(returns > 0), or contains only stop words (returns 0).
Examples:
SELECT numnode(plainto_tsquery('the any')); NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored numnode --------- 0 SELECT numnode('foo & bar'::tsquery); numnode --------- 3
querytree(query
tsquery
) returns text
Returns the portion of a tsquery
that can be used for
searching an index. This function is useful for detecting
unindexable queries, for example those containing only stop words
or only negated terms. For example:
SELECT querytree(to_tsquery('defined')); querytree ----------- 'defin' SELECT querytree(to_tsquery('!defined')); querytree ----------- T
The ts_rewrite
family of functions search a
given tsquery
for occurrences of a target
subquery, and replace each occurrence with a
substitute subquery. In essence this operation is a
tsquery
-specific version of substring replacement.
A target and substitute combination can be
thought of as a query rewrite rule. A collection
of such rewrite rules can be a powerful search aid.
For example, you can expand the search using synonyms
(e.g., new york
, big apple
, nyc
,
gotham
) or narrow the search to direct the user to some hot
topic. There is some overlap in functionality between this feature
and thesaurus dictionaries (Section 12.6.4).
However, you can modify a set of rewrite rules on-the-fly without
reindexing, whereas updating a thesaurus requires reindexing to be
effective.
ts_rewrite (query
tsquery
, target
tsquery
, substitute
tsquery
) returns tsquery
This form of ts_rewrite
simply applies a single
rewrite rule: target
is replaced by substitute
wherever it appears in query
. For example:
SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery); ts_rewrite ------------ 'b' & 'c'
ts_rewrite (query
tsquery
, select
text
) returns tsquery
This form of ts_rewrite
accepts a starting
query
and an SQL select
command, which
is given as a text string. The select
must yield two
columns of tsquery
type. For each row of the
select
result, occurrences of the first column value
(the target) are replaced by the second column value (the substitute)
within the current query
value. For example:
CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery); INSERT INTO aliases VALUES('a', 'c'); SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases'); ts_rewrite ------------ 'b' & 'c'
Note that when multiple rewrite rules are applied in this way,
the order of application can be important; so in practice you will
want the source query to ORDER BY
some ordering key.
Let's consider a real-life astronomical example. We'll expand query
supernovae
using table-driven rewriting rules:
CREATE TABLE aliases (t tsquery primary key, s tsquery); INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn')); SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases'); ts_rewrite --------------------------------- 'crab' & ( 'supernova' | 'sn' )
We can change the rewriting rules just by updating the table:
UPDATE aliases SET s = to_tsquery('supernovae|sn & !nebulae') WHERE t = to_tsquery('supernovae'); SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases'); ts_rewrite --------------------------------------------- 'crab' & ( 'supernova' | 'sn' & !'nebula' )
Rewriting can be slow when there are many rewriting rules, since it
checks every rule for a possible match. To filter out obvious non-candidate
rules we can use the containment operators for the tsquery
type. In the example below, we select only those rules which might match
the original query:
SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t'); ts_rewrite ------------ 'b' & 'c'
The method described in this section has been obsoleted by the use of stored generated columns, as described in Section 12.2.2.
When using a separate column to store the tsvector
representation
of your documents, it is necessary to create a trigger to update the
tsvector
column when the document content columns change.
Two built-in trigger functions are available for this, or you can write
your own.
tsvector_update_trigger(tsvector_column_name
,config_name
,text_column_name
[, ... ]) tsvector_update_trigger_column(tsvector_column_name
,config_column_name
,text_column_name
[, ... ])
These trigger functions automatically compute a tsvector
column from one or more textual columns, under the control of
parameters specified in the CREATE TRIGGER
command.
An example of their use is:
CREATE TABLE messages ( title text, body text, tsv tsvector ); CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE FUNCTION tsvector_update_trigger(tsv, 'pg_catalog.english', title, body); INSERT INTO messages VALUES('title here', 'the body text is here'); SELECT * FROM messages; title | body | tsv ------------+-----------------------+---------------------------- title here | the body text is here | 'bodi':4 'text':5 'titl':1 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body'); title | body ------------+----------------------- title here | the body text is here
Having created this trigger, any change in title
or
body
will automatically be reflected into
tsv
, without the application having to worry about it.
The first trigger argument must be the name of the tsvector
column to be updated. The second argument specifies the text search
configuration to be used to perform the conversion. For
tsvector_update_trigger
, the configuration name is simply
given as the second trigger argument. It must be schema-qualified as
shown above, so that the trigger behavior will not change with changes
in search_path
. For
tsvector_update_trigger_column
, the second trigger argument
is the name of another table column, which must be of type
regconfig
. This allows a per-row selection of configuration
to be made. The remaining argument(s) are the names of textual columns
(of type text
, varchar
, or char
). These
will be included in the document in the order given. NULL values will
be skipped (but the other columns will still be indexed).
A limitation of these built-in triggers is that they treat all the input columns alike. To process columns differently — for example, to weight title differently from body — it is necessary to write a custom trigger. Here is an example using PL/pgSQL as the trigger language:
CREATE FUNCTION messages_trigger() RETURNS trigger AS $$ begin new.tsv := setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') || setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D'); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();
Keep in mind that it is important to specify the configuration name
explicitly when creating tsvector
values inside triggers,
so that the column's contents will not be affected by changes to
default_text_search_config
. Failure to do this is likely to
lead to problems such as search results changing after a dump and restore.
The function ts_stat
is useful for checking your
configuration and for finding stop-word candidates.
ts_stat(sqlquery
text
, [weights
text
, ] OUTword
text
, OUTndoc
integer
, OUTnentry
integer
) returnssetof record
sqlquery
is a text value containing an SQL
query which must return a single tsvector
column.
ts_stat
executes the query and returns statistics about
each distinct lexeme (word) contained in the tsvector
data. The columns returned are
word
text
— the value of a lexeme
ndoc
integer
— number of documents
(tsvector
s) the word occurred in
nentry
integer
— total number of
occurrences of the word
If weights
is supplied, only occurrences
having one of those weights are counted.
For example, to find the ten most frequent words in a document collection:
SELECT * FROM ts_stat('SELECT vector FROM apod') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
The same, but counting only word occurrences with weight A
or B
:
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
Text search parsers are responsible for splitting raw document text into tokens and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications.
The built-in parser is named pg_catalog.default
.
It recognizes 23 token types, shown in Table 12.1.
Table 12.1. Default Parser's Token Types
Alias | Description | Example |
---|---|---|
asciiword | Word, all ASCII letters | elephant |
word | Word, all letters | mañana |
numword | Word, letters and digits | beta1 |
asciihword | Hyphenated word, all ASCII | up-to-date |
hword | Hyphenated word, all letters | lógico-matemática |
numhword | Hyphenated word, letters and digits | postgresql-beta1 |
hword_asciipart | Hyphenated word part, all ASCII | postgresql in the context postgresql-beta1 |
hword_part | Hyphenated word part, all letters | lógico or matemática
in the context lógico-matemática |
hword_numpart | Hyphenated word part, letters and digits | beta1 in the context
postgresql-beta1 |
email | Email address | foo@example.com |
protocol | Protocol head | http:// |
url | URL | example.com/stuff/index.html |
host | Host | example.com |
url_path | URL path | /stuff/index.html , in the context of a URL |
file | File or path name | /usr/local/foo.txt , if not within a URL |
sfloat | Scientific notation | -1.234e56 |
float | Decimal notation | -1.234 |
int | Signed integer | -1234 |
uint | Unsigned integer | 1234 |
version | Version number | 8.3.0 |
tag | XML tag | <a href="dictionaries.html"> |
entity | XML entity | & |
blank | Space symbols | (any whitespace or punctuation not otherwise recognized) |
The parser's notion of a “letter” is determined by the database's
locale setting, specifically lc_ctype
. Words containing
only the basic ASCII letters are reported as a separate token type,
since it is sometimes useful to distinguish them. In most European
languages, token types word
and asciiword
should be treated alike.
email
does not support all valid email characters as
defined by RFC 5322.
Specifically, the only non-alphanumeric characters supported for
email user names are period, dash, and underscore.
It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1'); alias | description | token -----------------+------------------------------------------+--------------- numhword | Hyphenated word, letters and digits | foo-bar-beta1 hword_asciipart | Hyphenated word part, all ASCII | foo blank | Space symbols | - hword_asciipart | Hyphenated word part, all ASCII | bar blank | Space symbols | - hword_numpart | Hyphenated word part, letters and digits | beta1
This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example:
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html'); alias | description | token ----------+---------------+------------------------------ protocol | Protocol head | http:// url | URL | example.com/stuff/index.html host | Host | example.com url_path | URL path | /stuff/index.html
Dictionaries are used to eliminate words that should not be considered in a
search (stop words), and to normalize words so
that different derived forms of the same word will match. A successfully
normalized word is called a lexeme. Aside from
improving search quality, normalization and removal of stop words reduce the
size of the tsvector
representation of a document, thereby
improving performance. Normalization does not always have linguistic meaning
and usually depends on application semantics.
Some examples of normalization:
Linguistic — Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
URL locations can be canonicalized to make equivalent URLs match:
http://www.pgsql.ru/db/mw/index.html
http://www.pgsql.ru/db/mw/
http://www.pgsql.ru/db/../db/mw/index.html
Color names can be replaced by their hexadecimal values, e.g.,
red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example 3.14159265359, 3.1415926, 3.14 will be the same after normalization if only two digits are kept after the decimal point.
A dictionary is a program that accepts a token as input and returns:
an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme)
a single lexeme with the TSL_FILTER
flag set, to replace
the original token with a new token to be passed to subsequent
dictionaries (a dictionary that does this is called a
filtering dictionary)
an empty array if the dictionary knows the token, but it is a stop word
NULL
if the dictionary does not recognize the input token
PostgreSQL provides predefined dictionaries for
many languages. There are also several predefined templates that can be
used to create new dictionaries with custom parameters. Each predefined
dictionary template is described below. If no existing
template is suitable, it is possible to create new ones; see the
contrib/
area of the PostgreSQL distribution
for examples.
A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token
type that the parser can return, a separate list of dictionaries is
specified by the configuration. When a token of that type is found
by the parser, each dictionary in the list is consulted in turn,
until some dictionary recognizes it as a known word. If it is identified
as a stop word, or if no dictionary recognizes the token, it will be
discarded and not indexed or searched for.
Normally, the first dictionary that returns a non-NULL
output determines the result, and any remaining dictionaries are not
consulted; but a filtering dictionary can replace the given word
with a modified word, which is then passed to subsequent dictionaries.
The general rule for configuring a list of dictionaries
is to place first the most narrow, most specific dictionary, then the more
general dictionaries, finishing with a very general dictionary, like
a Snowball stemmer or simple
, which
recognizes everything. For example, for an astronomy-specific search
(astro_en
configuration) one could bind token type
asciiword
(ASCII word) to a synonym dictionary of astronomical
terms, a general English dictionary and a Snowball English
stemmer:
ALTER TEXT SEARCH CONFIGURATION astro_en ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by the unaccent module.
Stop words are words that are very common, appear in almost every
document, and have no discrimination value. Therefore, they can be ignored
in the context of full text searching. For example, every English text
contains words like a
and the
, so it is
useless to store them in an index. However, stop words do affect the
positions in tsvector
, which in turn affect ranking:
SELECT to_tsvector('english', 'in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6
The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.05 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop')); ts_rank_cd ------------ 0.1
It is up to the specific dictionary how it treats stop words. For example,
ispell
dictionaries first normalize words and then
look at the list of stop words, while Snowball
stemmers
first check the list of stop words. The reason for the different
behavior is an attempt to decrease noise.
The simple
dictionary template operates by converting the
input token to lower case and checking it against a file of stop words.
If it is found in the file then an empty array is returned, causing
the token to be discarded. If not, the lower-cased form of the word
is returned as the normalized lexeme. Alternatively, the dictionary
can be configured to report non-stop-words as unrecognized, allowing
them to be passed on to the next dictionary in the list.
Here is an example of a dictionary definition using the simple
template:
CREATE TEXT SEARCH DICTIONARY public.simple_dict ( TEMPLATE = pg_catalog.simple, STOPWORDS = english );
Here, english
is the base name of a file of stop words.
The file's full name will be
$SHAREDIR/tsearch_data/english.stop
,
where $SHAREDIR
means the
PostgreSQL installation's shared-data directory,
often /usr/local/share/postgresql
(use pg_config
--sharedir
to determine it if you're not sure).
The file format is simply a list
of words, one per line. Blank lines and trailing spaces are ignored,
and upper case is folded to lower case, but no other processing is done
on the file contents.
Now we can test our dictionary:
SELECT ts_lexize('public.simple_dict', 'YeS'); ts_lexize ----------- {yes} SELECT ts_lexize('public.simple_dict', 'The'); ts_lexize ----------- {}
We can also choose to return NULL
, instead of the lower-cased
word, if it is not found in the stop words file. This behavior is
selected by setting the dictionary's Accept
parameter to
false
. Continuing the example:
ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false ); SELECT ts_lexize('public.simple_dict', 'YeS'); ts_lexize ----------- SELECT ts_lexize('public.simple_dict', 'The'); ts_lexize ----------- {}
With the default setting of Accept
= true
,
it is only useful to place a simple
dictionary at the end
of a list of dictionaries, since it will never pass on any token to
a following dictionary. Conversely, Accept
= false
is only useful when there is at least one following dictionary.
Most types of dictionaries rely on configuration files, such as files of stop words. These files must be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server.
Normally, a database session will read a dictionary configuration file
only once, when it is first used within the session. If you modify a
configuration file and want to force existing sessions to pick up the
new contents, issue an ALTER TEXT SEARCH DICTIONARY
command
on the dictionary. This can be a “dummy” update that doesn't
actually change any parameter values.
This dictionary template is used to create dictionaries that replace a
word with a synonym. Phrases are not supported (use the thesaurus
template (Section 12.6.4) for that). A synonym
dictionary can be used to overcome linguistic problems, for example, to
prevent an English stemmer dictionary from reducing the word “Paris” to
“pari”. It is enough to have a Paris paris
line in the
synonym dictionary and put it before the english_stem
dictionary. For example:
SELECT * FROM ts_debug('english', 'Paris'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari} CREATE TEXT SEARCH DICTIONARY my_synonym ( TEMPLATE = synonym, SYNONYMS = my_synonyms ); ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR asciiword WITH my_synonym, english_stem; SELECT * FROM ts_debug('english', 'Paris'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+---------------------------+------------+--------- asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
The only parameter required by the synonym
template is
SYNONYMS
, which is the base name of its configuration file
— my_synonyms
in the above example.
The file's full name will be
$SHAREDIR/tsearch_data/my_synonyms.syn
(where $SHAREDIR
means the
PostgreSQL installation's shared-data directory).
The file format is just one line
per word to be substituted, with the word followed by its synonym,
separated by white space. Blank lines and trailing spaces are ignored.
The synonym
template also has an optional parameter
CaseSensitive
, which defaults to false
. When
CaseSensitive
is false
, words in the synonym file
are folded to lower case, as are input tokens. When it is
true
, words and tokens are not folded to lower case,
but are compared as-is.
An asterisk (*
) can be placed at the end of a synonym
in the configuration file. This indicates that the synonym is a prefix.
The asterisk is ignored when the entry is used in
to_tsvector()
, but when it is used in
to_tsquery()
, the result will be a query item with
the prefix match marker (see
Section 12.3.2).
For example, suppose we have these entries in
$SHAREDIR/tsearch_data/synonym_sample.syn
:
postgres pgsql postgresql pgsql postgre pgsql gogle googl indices index*
Then we will get these results:
mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); mydb=# SELECT ts_lexize('syn', 'indices'); ts_lexize ----------- {index} (1 row) mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; mydb=# SELECT to_tsvector('tst', 'indices'); to_tsvector ------------- 'index':1 (1 row) mydb=# SELECT to_tsquery('tst', 'indices'); to_tsquery ------------ 'index':* (1 row) mydb=# SELECT 'indexes are very useful'::tsvector; tsvector --------------------------------- 'are' 'indexes' 'useful' 'very' (1 row) mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices'); ?column? ---------- t (1 row)
A thesaurus dictionary (sometimes abbreviated as TZ) is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc.
Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well. PostgreSQL's current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added phrase support. A thesaurus dictionary requires a configuration file of the following format:
# this is a comment sample word(s) : indexed word(s) more sample word(s) : more indexed word(s) ...
where the colon (:
) symbol acts as a delimiter between a
phrase and its replacement.
A thesaurus dictionary uses a subdictionary (which
is specified in the dictionary's configuration) to normalize the input
text before checking for phrase matches. It is only possible to select one
subdictionary. An error is reported if the subdictionary fails to
recognize a word. In that case, you should remove the use of the word or
teach the subdictionary about it. You can place an asterisk
(*
) at the beginning of an indexed word to skip applying
the subdictionary to it, but all sample words must be known
to the subdictionary.
The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition.
Specific stop words recognized by the subdictionary cannot be
specified; instead use ?
to mark the location where any
stop word can appear. For example, assuming that a
and
the
are stop words according to the subdictionary:
? one ? two : swsw
matches a one the two
and the one a two
;
both would be replaced by swsw
.
Since a thesaurus dictionary has the capability to recognize phrases it
must remember its state and interact with the parser. A thesaurus dictionary
uses these assignments to check if it should handle the next word or stop
accumulation. The thesaurus dictionary must be configured
carefully. For example, if the thesaurus dictionary is assigned to handle
only the asciiword
token, then a thesaurus dictionary
definition like one 7
will not work since token type
uint
is not assigned to the thesaurus dictionary.
Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters requires reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing.
To define a new thesaurus dictionary, use the thesaurus
template. For example:
CREATE TEXT SEARCH DICTIONARY thesaurus_simple ( TEMPLATE = thesaurus, DictFile = mythesaurus, Dictionary = pg_catalog.english_stem );
Here:
thesaurus_simple
is the new dictionary's name
mythesaurus
is the base name of the thesaurus
configuration file.
(Its full name will be $SHAREDIR/tsearch_data/mythesaurus.ths
,
where $SHAREDIR
means the installation shared-data
directory.)
pg_catalog.english_stem
is the subdictionary (here,
a Snowball English stemmer) to use for thesaurus normalization.
Notice that the subdictionary will have its own
configuration (for example, stop words), which is not shown here.
Now it is possible to bind the thesaurus dictionary thesaurus_simple
to the desired token types in a configuration, for example:
ALTER TEXT SEARCH CONFIGURATION russian ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_simple;
Consider a simple astronomical thesaurus thesaurus_astro
,
which contains some astronomical word combinations:
supernovae stars : sn crab nebulae : crab
Below we create a dictionary and bind some token types to an astronomical thesaurus and English stemmer:
CREATE TEXT SEARCH DICTIONARY thesaurus_astro ( TEMPLATE = thesaurus, DictFile = thesaurus_astro, Dictionary = english_stem ); ALTER TEXT SEARCH CONFIGURATION russian ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH thesaurus_astro, english_stem;
Now we can see how it works.
ts_lexize
is not very useful for testing a thesaurus,
because it treats its input as a single token. Instead we can use
plainto_tsquery
and to_tsvector
which will break their input strings into multiple tokens:
SELECT plainto_tsquery('supernova star'); plainto_tsquery ----------------- 'sn' SELECT to_tsvector('supernova star'); to_tsvector ------------- 'sn':1
In principle, one can use to_tsquery
if you quote
the argument:
SELECT to_tsquery('''supernova star'''); to_tsquery ------------ 'sn'
Notice that supernova star
matches supernovae
stars
in thesaurus_astro
because we specified
the english_stem
stemmer in the thesaurus definition.
The stemmer removed the e
and s
.
To index the original phrase as well as the substitute, just include it in the right-hand part of the definition:
supernovae stars : sn supernovae stars SELECT plainto_tsquery('supernova star'); plainto_tsquery ----------------------------- 'sn' & 'supernova' & 'star'
The Ispell dictionary template supports
morphological dictionaries, which can normalize many
different linguistic forms of a word into the same lexeme. For example,
an English Ispell dictionary can match all declensions and
conjugations of the search term bank
, e.g.,
banking
, banked
, banks
,
banks'
, and bank's
.
The standard PostgreSQL distribution does not include any Ispell configuration files. Dictionaries for a large number of languages are available from Ispell. Also, some more modern dictionary file formats are supported — MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A large list of dictionaries is available on the OpenOffice Wiki.
To create an Ispell dictionary perform these steps:
download dictionary configuration files. OpenOffice
extension files have the .oxt
extension. It is necessary
to extract .aff
and .dic
files, change
extensions to .affix
and .dict
. For some
dictionary files it is also needed to convert characters to the UTF-8
encoding with commands (for example, for a Norwegian language dictionary):
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
copy files to the $SHAREDIR/tsearch_data
directory
load files into PostgreSQL with the following command:
CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, DictFile = en_us, AffFile = en_us, Stopwords = english);
Here, DictFile
, AffFile
, and StopWords
specify the base names of the dictionary, affixes, and stop-words files.
The stop-words file has the same format explained above for the
simple
dictionary type. The format of the other files is
not specified here but is available from the above-mentioned web sites.
Ispell dictionaries usually recognize a limited set of words, so they should be followed by another broader dictionary; for example, a Snowball dictionary, which recognizes everything.
The .affix
file of Ispell has the following
structure:
prefixes flag *A: . > RE # As in enter > reenter suffixes flag T: E > ST # As in late > latest [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest [AEIOU]Y > EST # As in gray > grayest [^EY] > EST # As in small > smallest
And the .dict
file has the following structure:
lapse/ADGRS lard/DGRS large/PRTY lark/MRS
Format of the .dict
file is:
basic_form/affix_class_name
In the .affix
file every affix flag is described in the
following format:
condition > [-stripping_letters,] adding_affix
Here, condition has a format similar to the format of regular expressions.
It can use groupings [...]
and [^...]
.
For example, [AEIOU]Y
means that the last letter of the word
is "y"
and the penultimate letter is "a"
,
"e"
, "i"
, "o"
or "u"
.
[^EY]
means that the last letter is neither "e"
nor "y"
.
Ispell dictionaries support splitting compound words;
a useful feature.
Notice that the affix file should specify a special flag using the
compoundwords controlled
statement that marks dictionary
words that can participate in compound formation:
compoundwords controlled z
Here are some examples for the Norwegian language:
SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent'); {over,buljong,terning,pakk,mester,assistent} SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); {sjokoladefabrikk,sjokolade,fabrikk}
MySpell format is a subset of Hunspell.
The .affix
file of Hunspell has the following
structure:
PFX A Y 1 PFX A 0 re . SFX T N 4 SFX T 0 st e SFX T y iest [^aeiou]y SFX T 0 est [aeiou]y SFX T 0 est [^ey]
The first line of an affix class is the header. Fields of an affix rules are listed after the header:
parameter name (PFX or SFX)
flag (name of the affix class)
stripping characters from beginning (at prefix) or end (at suffix) of the word
adding affix
condition that has a format similar to the format of regular expressions.
The .dict
file looks like the .dict
file of
Ispell:
larder/M lardy/RT large/RSPMYT largehearted
MySpell does not support compound words. Hunspell has sophisticated support for compound words. At present, PostgreSQL implements only the basic compound word operations of Hunspell.
The Snowball dictionary template is based on a project
by Martin Porter, inventor of the popular Porter's stemming algorithm
for the English language. Snowball now provides stemming algorithms for
many languages (see the Snowball
site for more information). Each algorithm understands how to
reduce common variant forms of words to a base, or stem, spelling within
its language. A Snowball dictionary requires a language
parameter to identify which stemmer to use, and optionally can specify a
stopword
file name that gives a list of words to eliminate.
(PostgreSQL's standard stopword lists are also
provided by the Snowball project.)
For example, there is a built-in definition equivalent to
CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english );
The stopword file format is the same as already explained.
A Snowball dictionary recognizes everything, whether or not it is able to simplify the word, so it should be placed at the end of the dictionary list. It is useless to have it before any other dictionary because a token will never pass through it to the next dictionary.
A text search configuration specifies all options necessary to transform a
document into a tsvector
: the parser to use to break text
into tokens, and the dictionaries to use to transform each token into a
lexeme. Every call of
to_tsvector
or to_tsquery
needs a text search configuration to perform its processing.
The configuration parameter
default_text_search_config
specifies the name of the default configuration, which is the
one used by text search functions if an explicit configuration
parameter is omitted.
It can be set in postgresql.conf
, or set for an
individual session using the SET
command.
Several predefined text search configurations are available, and you can create custom configurations easily. To facilitate management of text search objects, a set of SQL commands is available, and there are several psql commands that display information about text search objects (Section 12.10).
As an example we will create a configuration
pg
, starting by duplicating the built-in
english
configuration:
CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
We will use a PostgreSQL-specific synonym list
and store it in $SHAREDIR/tsearch_data/pg_dict.syn
.
The file contents look like:
postgres pg pgsql pg postgresql pg
We define the synonym dictionary like this:
CREATE TEXT SEARCH DICTIONARY pg_dict ( TEMPLATE = synonym, SYNONYMS = pg_dict );
Next we register the Ispell dictionary
english_ispell
, which has its own configuration files:
CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = english, AffFile = english, StopWords = english );
Now we can set up the mappings for words in configuration
pg
:
ALTER TEXT SEARCH CONFIGURATION pg ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH pg_dict, english_ispell, english_stem;
We choose not to index or search some token types that the built-in configuration does handle:
ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, url_path, sfloat, float;
Now we can test our configuration:
SELECT * FROM ts_debug('public.pg', ' PostgreSQL, the highly scalable, SQL compliant, open source object-relational database management system, is now undergoing beta testing of the next version of our software. ');
The next step is to set the session to use the new configuration, which was
created in the public
schema:
=> \dF List of text search configurations Schema | Name | Description ---------+------+------------- public | pg | SET default_text_search_config = 'public.pg'; SET SHOW default_text_search_config; default_text_search_config ---------------------------- public.pg
The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.
The function ts_debug
allows easy testing of a
text search configuration.
ts_debug([config
regconfig
, ]document
text
, OUTalias
text
, OUTdescription
text
, OUTtoken
text
, OUTdictionaries
regdictionary[]
, OUTdictionary
regdictionary
, OUTlexemes
text[]
) returns setof record
ts_debug
displays information about every token of
document
as produced by the
parser and processed by the configured dictionaries. It uses the
configuration specified by config
,
or default_text_search_config
if that argument is
omitted.
ts_debug
returns one row for each token identified in the text
by the parser. The columns returned are
alias
text
— short name of the token type
description
text
— description of the
token type
token
text
— text of the token
dictionaries
regdictionary[]
— the
dictionaries selected by the configuration for this token type
dictionary
regdictionary
— the dictionary
that recognized the token, or NULL
if none did
lexemes
text[]
— the lexeme(s) produced
by the dictionary that recognized the token, or NULL
if
none did; an empty array ({}
) means it was recognized as a
stop word
Here is a simple example:
SELECT * FROM ts_debug('english', 'a fat cat sat on a mat - it ate a fat rats'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | on | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat} blank | Space symbols | | {} | | blank | Space symbols | - | {} | | asciiword | Word, all ASCII | it | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate} blank | Space symbols | | {} | | asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} blank | Space symbols | | {} | | asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
For a more extensive demonstration, we
first create a public.english
configuration and
Ispell dictionary for the English language:
CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english ); CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = english, AffFile = english, StopWords = english ); ALTER TEXT SEARCH CONFIGURATION public.english ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------------+-------------------------------+----------------+------------- asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright} blank | Space symbols | | {} | | asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
In this example, the word Brightest
was recognized by the
parser as an ASCII word
(alias asciiword
).
For this token type the dictionary list is
english_ispell
and
english_stem
. The word was recognized by
english_ispell
, which reduced it to the noun
bright
. The word supernovaes
is
unknown to the english_ispell
dictionary so it
was passed to the next dictionary, and, fortunately, was recognized (in
fact, english_stem
is a Snowball dictionary which
recognizes everything; that is why it was placed at the end of the
dictionary list).
The word The
was recognized by the
english_ispell
dictionary as a stop word (Section 12.6.1) and will not be indexed.
The spaces are discarded too, since the configuration provides no
dictionaries at all for them.
You can reduce the width of the output by explicitly specifying which columns you want to see:
SELECT alias, token, dictionary, lexemes FROM ts_debug('public.english', 'The Brightest supernovaes'); alias | token | dictionary | lexemes -----------+-------------+----------------+------------- asciiword | The | english_ispell | {} blank | | | asciiword | Brightest | english_ispell | {bright} blank | | | asciiword | supernovaes | english_stem | {supernova}
The following functions allow direct testing of a text search parser.
ts_parse(parser_name
text
,document
text
, OUTtokid
integer
, OUTtoken
text
) returnssetof record
ts_parse(parser_oid
oid
,document
text
, OUTtokid
integer
, OUTtoken
text
) returnssetof record
ts_parse
parses the given document
and returns a series of records, one for each token produced by
parsing. Each record includes a tokid
showing the
assigned token type and a token
which is the text of the
token. For example:
SELECT * FROM ts_parse('default', '123 - a number'); tokid | token -------+-------- 22 | 123 12 | 12 | - 1 | a 12 | 1 | number
ts_token_type(parser_name
text
, OUTtokid
integer
, OUTalias
text
, OUTdescription
text
) returnssetof record
ts_token_type(parser_oid
oid
, OUTtokid
integer
, OUTalias
text
, OUTdescription
text
) returnssetof record
ts_token_type
returns a table which describes each type of
token the specified parser can recognize. For each token type, the table
gives the integer tokid
that the parser uses to label a
token of that type, the alias
that names the token type
in configuration commands, and a short description
. For
example:
SELECT * FROM ts_token_type('default'); tokid | alias | description -------+-----------------+------------------------------------------ 1 | asciiword | Word, all ASCII 2 | word | Word, all letters 3 | numword | Word, letters and digits 4 | email | Email address 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | Version number 9 | hword_numpart | Hyphenated word part, letters and digits 10 | hword_part | Hyphenated word part, all letters 11 | hword_asciipart | Hyphenated word part, all ASCII 12 | blank | Space symbols 13 | tag | XML tag 14 | protocol | Protocol head 15 | numhword | Hyphenated word, letters and digits 16 | asciihword | Hyphenated word, all ASCII 17 | hword | Hyphenated word, all letters 18 | url_path | URL path 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | XML entity
The ts_lexize
function facilitates dictionary testing.
ts_lexize(dict
regdictionary
,token
text
) returnstext[]
ts_lexize
returns an array of lexemes if the input
token
is known to the dictionary,
or an empty array if the token
is known to the dictionary but it is a stop word, or
NULL
if it is an unknown word.
Examples:
SELECT ts_lexize('english_stem', 'stars'); ts_lexize ----------- {star} SELECT ts_lexize('english_stem', 'a'); ts_lexize ----------- {}
The ts_lexize
function expects a single
token, not text. Here is a case
where this can be confusing:
SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null; ?column? ---------- t
The thesaurus dictionary thesaurus_astro
does know the
phrase supernovae stars
, but ts_lexize
fails since it does not parse the input text but treats it as a single
token. Use plainto_tsquery
or to_tsvector
to
test thesaurus dictionaries, for example:
SELECT plainto_tsquery('supernovae stars'); plainto_tsquery ----------------- 'sn'
There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
To create such an index, do one of:
CREATE INDEX name
ON table
USING GIN (column
);
Creates a GIN (Generalized Inverted Index)-based index.
The column
must be of tsvector
type.
CREATE INDEX name
ON table
USING GIST (column
[ { DEFAULT | tsvector_ops } (siglen = number
) ] );
Creates a GiST (Generalized Search Tree)-based index.
The column
can be of tsvector
or
tsquery
type.
Optional integer parameter siglen
determines
signature length in bytes (see below for details).
GIN indexes are the preferred text search index type. As inverted
indexes, they contain an index entry for each word (lexeme), with a
compressed list of matching locations. Multi-word searches can find
the first match, then use the index to remove rows that are lacking
additional words. GIN indexes store only the words (lexemes) of
tsvector
values, and not their weight labels. Thus a table
row recheck is needed when using a query that involves weights.
A GiST index is lossy, meaning that the index
might produce false matches, and it is necessary
to check the actual table row to eliminate such false matches.
(PostgreSQL does this automatically when needed.)
GiST indexes are lossy because each document is represented in the
index by a fixed-length signature. The signature length in bytes is determined
by the value of the optional integer parameter siglen
.
The default signature length (when siglen
is not specified) is
124 bytes, the maximum signature length is 2024 bytes. The signature is generated by hashing
each word into a single bit in an n-bit string, with all these bits OR-ed
together to produce an n-bit document signature. When two words hash to
the same bit position there will be a false match. If all words in
the query have matches (real or false) then the table row must be
retrieved to see if the match is correct. Longer signatures lead to a more
precise search (scanning a smaller fraction of the index and fewer heap
pages), at the cost of a larger index.
A GiST index can be covering, i.e., use the INCLUDE
clause. Included columns can have data types without any GiST operator
class. Included attributes will be stored uncompressed.
Lossiness causes performance degradation due to unnecessary fetches of table records that turn out to be false matches. Since random access to table records is slow, this limits the usefulness of GiST indexes. The likelihood of false matches depends on several factors, in particular the number of unique words, so using dictionaries to reduce this number is recommended.
Note that GIN index build time can often be improved by increasing maintenance_work_mem, while GiST index build time is not sensitive to that parameter.
Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance, or by distributing documents over servers and collecting external search results, e.g., via Foreign Data access. The latter is possible because ranking functions use only local information.
Information about text search configuration objects can be obtained in psql using a set of commands:
\dF{d,p,t}[+] [PATTERN]
An optional +
produces more details.
The optional parameter PATTERN
can be the name of
a text search object, optionally schema-qualified. If
PATTERN
is omitted then information about all
visible objects will be displayed. PATTERN
can be a
regular expression and can provide separate patterns
for the schema and object names. The following examples illustrate this:
=> \dF *fulltext* List of text search configurations Schema | Name | Description --------+--------------+------------- public | fulltext_cfg |
=> \dF *.fulltext* List of text search configurations Schema | Name | Description ----------+---------------------------- fulltext | fulltext_cfg | public | fulltext_cfg |
The available commands are:
\dF[+] [PATTERN]
List text search configurations (add +
for more detail).
=> \dF russian List of text search configurations Schema | Name | Description ------------+---------+------------------------------------ pg_catalog | russian | configuration for russian language => \dF+ russian Text search configuration "pg_catalog.russian" Parser: "pg_catalog.default" Token | Dictionaries -----------------+-------------- asciihword | english_stem asciiword | english_stem email | simple file | simple float | simple host | simple hword | russian_stem hword_asciipart | english_stem hword_numpart | simple hword_part | russian_stem int | simple numhword | simple numword | simple sfloat | simple uint | simple url | simple url_path | simple version | simple word | russian_stem
\dFd[+] [PATTERN]
List text search dictionaries (add +
for more detail).
=> \dFd List of text search dictionaries Schema | Name | Description ------------+-----------------+----------------------------------------------------------- pg_catalog | arabic_stem | snowball stemmer for arabic language pg_catalog | armenian_stem | snowball stemmer for armenian language pg_catalog | basque_stem | snowball stemmer for basque language pg_catalog | catalan_stem | snowball stemmer for catalan language pg_catalog | danish_stem | snowball stemmer for danish language pg_catalog | dutch_stem | snowball stemmer for dutch language pg_catalog | english_stem | snowball stemmer for english language pg_catalog | finnish_stem | snowball stemmer for finnish language pg_catalog | french_stem | snowball stemmer for french language pg_catalog | german_stem | snowball stemmer for german language pg_catalog | greek_stem | snowball stemmer for greek language pg_catalog | hindi_stem | snowball stemmer for hindi language pg_catalog | hungarian_stem | snowball stemmer for hungarian language pg_catalog | indonesian_stem | snowball stemmer for indonesian language pg_catalog | irish_stem | snowball stemmer for irish language pg_catalog | italian_stem | snowball stemmer for italian language pg_catalog | lithuanian_stem | snowball stemmer for lithuanian language pg_catalog | nepali_stem | snowball stemmer for nepali language pg_catalog | norwegian_stem | snowball stemmer for norwegian language pg_catalog | portuguese_stem | snowball stemmer for portuguese language pg_catalog | romanian_stem | snowball stemmer for romanian language pg_catalog | russian_stem | snowball stemmer for russian language pg_catalog | serbian_stem | snowball stemmer for serbian language pg_catalog | simple | simple dictionary: just lower case and check for stopword pg_catalog | spanish_stem | snowball stemmer for spanish language pg_catalog | swedish_stem | snowball stemmer for swedish language pg_catalog | tamil_stem | snowball stemmer for tamil language pg_catalog | turkish_stem | snowball stemmer for turkish language pg_catalog | yiddish_stem | snowball stemmer for yiddish language
\dFp[+] [PATTERN]
List text search parsers (add +
for more detail).
=> \dFp List of text search parsers Schema | Name | Description ------------+---------+--------------------- pg_catalog | default | default word parser => \dFp+ Text search parser "pg_catalog.default" Method | Function | Description -----------------+----------------+------------- Start parse | prsd_start | Get next token | prsd_nexttoken | End parse | prsd_end | Get headline | prsd_headline | Get token types | prsd_lextype | Token types for parser "pg_catalog.default" Token name | Description -----------------+------------------------------------------ asciihword | Hyphenated word, all ASCII asciiword | Word, all ASCII blank | Space symbols email | Email address entity | XML entity file | File or path name float | Decimal notation host | Host hword | Hyphenated word, all letters hword_asciipart | Hyphenated word part, all ASCII hword_numpart | Hyphenated word part, letters and digits hword_part | Hyphenated word part, all letters int | Signed integer numhword | Hyphenated word, letters and digits numword | Word, letters and digits protocol | Protocol head sfloat | Scientific notation tag | XML tag uint | Unsigned integer url | URL url_path | URL path version | Version number word | Word, all letters (23 rows)
\dFt[+] [PATTERN]
List text search templates (add +
for more detail).
=> \dFt List of text search templates Schema | Name | Description ------------+-----------+----------------------------------------------------------- pg_catalog | ispell | ispell dictionary pg_catalog | simple | simple dictionary: just lower case and check for stopword pg_catalog | snowball | snowball stemmer pg_catalog | synonym | synonym dictionary: replace word by its synonym pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
The current limitations of PostgreSQL's text search features are:
The length of each lexeme must be less than 2 kilobytes
The length of a tsvector
(lexemes + positions) must be
less than 1 megabyte
The number of lexemes must be less than 264
Position values in tsvector
must be greater than 0 and
no more than 16,383
The match distance in a <
(FOLLOWED BY) N
>tsquery
operator cannot be more than
16,384
No more than 256 positions per lexeme
The number of nodes (lexemes + operators) in a tsquery
must be less than 32,768
For comparison, the PostgreSQL 8.1 documentation contained 10,441 unique words, a total of 335,420 words, and the most frequent word “postgresql” was mentioned 6,127 times in 655 documents.
Another example — the PostgreSQL mailing list archives contained 910,989 unique words with 57,491,343 lexemes in 461,020 messages.
Table of Contents
This chapter describes the behavior of the PostgreSQL database system when two or more sessions try to access the same data at the same time. The goals in that situation are to allow efficient access for all sessions while maintaining strict data integrity. Every developer of database applications should be familiar with the topics covered in this chapter.
PostgreSQL provides a rich set of tools for developers to manage concurrent access to data. Internally, data consistency is maintained by using a multiversion model (Multiversion Concurrency Control, MVCC). This means that each SQL statement sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session. MVCC, by eschewing the locking methodologies of traditional database systems, minimizes lock contention in order to allow for reasonable performance in multiuser environments.
The main advantage of using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading. PostgreSQL maintains this guarantee even when providing the strictest level of transaction isolation through the use of an innovative Serializable Snapshot Isolation (SSI) level.
Table- and row-level locking facilities are also available in PostgreSQL for applications which don't generally need full transaction isolation and prefer to explicitly manage particular points of conflict. However, proper use of MVCC will generally provide better performance than locks. In addition, application-defined advisory locks provide a mechanism for acquiring locks that are not tied to a single transaction.
The SQL standard defines four levels of transaction isolation. The most strict is Serializable, which is defined by the standard in a paragraph which says that any concurrent execution of a set of Serializable transactions is guaranteed to produce the same effect as running them one at a time in some order. The other three levels are defined in terms of phenomena, resulting from interaction between concurrent transactions, which must not occur at each level. The standard notes that due to the definition of Serializable, none of these phenomena are possible at that level. (This is hardly surprising -- if the effect of the transactions must be consistent with having been run one at a time, how could you see any phenomena caused by interactions?)
The phenomena which are prohibited at various levels are:
A transaction reads data written by a concurrent uncommitted transaction.
A transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).
A transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.
The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.
The SQL standard and PostgreSQL-implemented transaction isolation levels are described in Table 13.1.
Table 13.1. Transaction Isolation Levels
Isolation Level | Dirty Read | Nonrepeatable Read | Phantom Read | Serialization Anomaly |
---|---|---|---|---|
Read uncommitted | Allowed, but not in PG | Possible | Possible | Possible |
Read committed | Not possible | Possible | Possible | Possible |
Repeatable read | Not possible | Not possible | Allowed, but not in PG | Possible |
Serializable | Not possible | Not possible | Not possible | Not possible |
In PostgreSQL, you can request any of the four standard transaction isolation levels, but internally only three distinct isolation levels are implemented, i.e., PostgreSQL's Read Uncommitted mode behaves like Read Committed. This is because it is the only sensible way to map the standard isolation levels to PostgreSQL's multiversion concurrency control architecture.
The table also shows that PostgreSQL's Repeatable Read implementation does not allow phantom reads. This is acceptable under the SQL standard because the standard specifies which anomalies must not occur at certain isolation levels; higher guarantees are acceptable. The behavior of the available isolation levels is detailed in the following subsections.
To set the transaction isolation level of a transaction, use the command SET TRANSACTION.
Some PostgreSQL data types and functions have
special rules regarding transactional behavior. In particular, changes
made to a sequence (and therefore the counter of a
column declared using serial
) are immediately visible
to all other transactions and are not rolled back if the transaction
that made the changes aborts. See Section 9.17
and Section 8.1.4.
Read Committed is the default isolation
level in PostgreSQL. When a transaction
uses this isolation level, a SELECT
query
(without a FOR UPDATE/SHARE
clause) sees only data
committed before the query began; it never sees either uncommitted
data or changes committed during query execution by concurrent
transactions. In effect, a SELECT
query sees
a snapshot of the database as of the instant the query begins to
run. However, SELECT
does see the effects
of previous updates executed within its own transaction, even
though they are not yet committed. Also note that two successive
SELECT
commands can see different data, even
though they are within a single transaction, if other transactions
commit changes after the first SELECT
starts and
before the second SELECT
starts.
UPDATE
, DELETE
, SELECT
FOR UPDATE
, and SELECT FOR SHARE
commands
behave the same as SELECT
in terms of searching for target rows: they will only find target rows
that were committed as of the command start time. However, such a target
row might have already been updated (or deleted or locked) by
another concurrent transaction by the time it is found. In this case, the
would-be updater will wait for the first updating transaction to commit or
roll back (if it is still in progress). If the first updater rolls back,
then its effects are negated and the second updater can proceed with
updating the originally found row. If the first updater commits, the
second updater will ignore the row if the first updater deleted it,
otherwise it will attempt to apply its operation to the updated version of
the row. The search condition of the command (the WHERE
clause) is
re-evaluated to see if the updated version of the row still matches the
search condition. If so, the second updater proceeds with its operation
using the updated version of the row. In the case of
SELECT FOR UPDATE
and SELECT FOR
SHARE
, this means it is the updated version of the row that is
locked and returned to the client.
INSERT
with an ON CONFLICT DO UPDATE
clause
behaves similarly. In Read Committed mode, each row proposed for insertion
will either insert or update. Unless there are unrelated errors, one of
those two outcomes is guaranteed. If a conflict originates in another
transaction whose effects are not yet visible to the INSERT
, the UPDATE
clause will affect that row,
even though possibly no version of that row is
conventionally visible to the command.
INSERT
with an ON CONFLICT DO
NOTHING
clause may have insertion not proceed for a row due to
the outcome of another transaction whose effects are not visible
to the INSERT
snapshot. Again, this is only
the case in Read Committed mode.
Because of the above rules, it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database. This behavior makes Read Committed mode unsuitable for commands that involve complex search conditions; however, it is just right for simpler cases. For example, consider updating bank balances with transactions like:
BEGIN; UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 12345; UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 7534; COMMIT;
If two such transactions concurrently try to change the balance of account 12345, we clearly want the second transaction to start with the updated version of the account's row. Because each command is affecting only a predetermined row, letting it see the updated version of the row does not create any troublesome inconsistency.
More complex usage can produce undesirable results in Read Committed
mode. For example, consider a DELETE
command
operating on data that is being both added and removed from its
restriction criteria by another command, e.g., assume
website
is a two-row table with
website.hits
equaling 9
and
10
:
BEGIN; UPDATE website SET hits = hits + 1; -- run from another session: DELETE FROM website WHERE hits = 10; COMMIT;
The DELETE
will have no effect even though
there is a website.hits = 10
row before and
after the UPDATE
. This occurs because the
pre-update row value 9
is skipped, and when the
UPDATE
completes and DELETE
obtains a lock, the new row value is no longer 10
but
11
, which no longer matches the criteria.
Because Read Committed mode starts each command with a new snapshot that includes all transactions committed up to that instant, subsequent commands in the same transaction will see the effects of the committed concurrent transaction in any case. The point at issue above is whether or not a single command sees an absolutely consistent view of the database.
The partial transaction isolation provided by Read Committed mode is adequate for many applications, and this mode is fast and simple to use; however, it is not sufficient for all cases. Applications that do complex queries and updates might require a more rigorously consistent view of the database than Read Committed mode provides.
The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions. (However, the query does see the effects of previous updates executed within its own transaction, even though they are not yet committed.) This is a stronger guarantee than is required by the SQL standard for this isolation level, and prevents all of the phenomena described in Table 13.1 except for serialization anomalies. As mentioned above, this is specifically allowed by the standard, which only describes the minimum protections each isolation level must provide.
This level is different from Read Committed in that a query in a
repeatable read transaction sees a snapshot as of the start of the
first non-transaction-control statement in the
transaction, not as of the start
of the current statement within the transaction. Thus, successive
SELECT
commands within a single
transaction see the same data, i.e., they do not see changes made by
other transactions that committed after their own transaction started.
Applications using this level must be prepared to retry transactions due to serialization failures.
UPDATE
, DELETE
, SELECT
FOR UPDATE
, and SELECT FOR SHARE
commands
behave the same as SELECT
in terms of searching for target rows: they will only find target rows
that were committed as of the transaction start time. However, such a
target row might have already been updated (or deleted or locked) by
another concurrent transaction by the time it is found. In this case, the
repeatable read transaction will wait for the first updating transaction to commit or
roll back (if it is still in progress). If the first updater rolls back,
then its effects are negated and the repeatable read transaction can proceed
with updating the originally found row. But if the first updater commits
(and actually updated or deleted the row, not just locked it)
then the repeatable read transaction will be rolled back with the message
ERROR: could not serialize access due to concurrent update
because a repeatable read transaction cannot modify or lock rows changed by other transactions after the repeatable read transaction began.
When an application receives this error message, it should abort the current transaction and retry the whole transaction from the beginning. The second time through, the transaction will see the previously-committed change as part of its initial view of the database, so there is no logical conflict in using the new version of the row as the starting point for the new transaction's update.
Note that only updating transactions might need to be retried; read-only transactions will never have serialization conflicts.
The Repeatable Read mode provides a rigorous guarantee that each transaction sees a completely stable view of the database. However, this view will not necessarily always be consistent with some serial (one at a time) execution of concurrent transactions of the same level. For example, even a read-only transaction at this level may see a control record updated to show that a batch has been completed but not see one of the detail records which is logically part of the batch because it read an earlier revision of the control record. Attempts to enforce business rules by transactions running at this isolation level are not likely to work correctly without careful use of explicit locks to block conflicting transactions.
The Repeatable Read isolation level is implemented using a technique known in academic database literature and in some other database products as Snapshot Isolation. Differences in behavior and performance may be observed when compared with systems that use a traditional locking technique that reduces concurrency. Some other systems may even offer Repeatable Read and Snapshot Isolation as distinct isolation levels with different behavior. The permitted phenomena that distinguish the two techniques were not formalized by database researchers until after the SQL standard was developed, and are outside the scope of this manual. For a full treatment, please see [berenson95].
Prior to PostgreSQL version 9.1, a request for the Serializable transaction isolation level provided exactly the same behavior described here. To retain the legacy Serializable behavior, Repeatable Read should now be requested.
The Serializable isolation level provides the strictest transaction isolation. This level emulates serial transaction execution for all committed transactions; as if transactions had been executed one after another, serially, rather than concurrently. However, like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures. In fact, this isolation level works exactly the same as Repeatable Read except that it monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. This monitoring does not introduce any blocking beyond that present in repeatable read, but there is some overhead to the monitoring, and detection of the conditions which could cause a serialization anomaly will trigger a serialization failure.
As an example,
consider a table mytab
, initially containing:
class | value -------+------- 1 | 10 1 | 20 2 | 100 2 | 200
Suppose that serializable transaction A computes:
SELECT SUM(value) FROM mytab WHERE class = 1;
and then inserts the result (30) as the value
in a
new row with class
= 2
. Concurrently, serializable
transaction B computes:
SELECT SUM(value) FROM mytab WHERE class = 2;
and obtains the result 300, which it inserts in a new row with
class
= 1
. Then both transactions try to commit.
If either transaction were running at the Repeatable Read isolation level,
both would be allowed to commit; but since there is no serial order of execution
consistent with the result, using Serializable transactions will allow one
transaction to commit and will roll the other back with this message:
ERROR: could not serialize access due to read/write dependencies among transactions
This is because if A had executed before B, B would have computed the sum 330, not 300, and similarly the other order would have resulted in a different sum computed by A.
When relying on Serializable transactions to prevent anomalies, it is important that any data read from a permanent user table not be considered valid until the transaction which read it has successfully committed. This is true even for read-only transactions, except that data read within a deferrable read-only transaction is known to be valid as soon as it is read, because such a transaction waits until it can acquire a snapshot guaranteed to be free from such problems before starting to read any data. In all other cases applications must not depend on results read during a transaction that later aborted; instead, they should retry the transaction until it succeeds.
To guarantee true serializability PostgreSQL
uses predicate locking, which means that it keeps locks
which allow it to determine when a write would have had an impact on
the result of a previous read from a concurrent transaction, had it run
first. In PostgreSQL these locks do not
cause any blocking and therefore can not play any part in
causing a deadlock. They are used to identify and flag dependencies
among concurrent Serializable transactions which in certain combinations
can lead to serialization anomalies. In contrast, a Read Committed or
Repeatable Read transaction which wants to ensure data consistency may
need to take out a lock on an entire table, which could block other
users attempting to use that table, or it may use SELECT FOR
UPDATE
or SELECT FOR SHARE
which not only
can block other transactions but cause disk access.
Predicate locks in PostgreSQL, like in most
other database systems, are based on data actually accessed by a
transaction. These will show up in the
pg_locks
system view with a mode
of SIReadLock
. The
particular locks
acquired during execution of a query will depend on the plan used by
the query, and multiple finer-grained locks (e.g., tuple locks) may be
combined into fewer coarser-grained locks (e.g., page locks) during the
course of the transaction to prevent exhaustion of the memory used to
track the locks. A READ ONLY
transaction may be able to
release its SIRead locks before completion, if it detects that no
conflicts can still occur which could lead to a serialization anomaly.
In fact, READ ONLY
transactions will often be able to
establish that fact at startup and avoid taking any predicate locks.
If you explicitly request a SERIALIZABLE READ ONLY DEFERRABLE
transaction, it will block until it can establish this fact. (This is
the only case where Serializable transactions block but
Repeatable Read transactions don't.) On the other hand, SIRead locks
often need to be kept past transaction commit, until overlapping read
write transactions complete.
Consistent use of Serializable transactions can simplify development.
The guarantee that any set of successfully committed concurrent
Serializable transactions will have the same effect as if they were run
one at a time means that if you can demonstrate that a single transaction,
as written, will do the right thing when run by itself, you can have
confidence that it will do the right thing in any mix of Serializable
transactions, even without any information about what those other
transactions might do, or it will not successfully commit. It is
important that an environment which uses this technique have a
generalized way of handling serialization failures (which always return
with an SQLSTATE value of '40001'), because it will be very hard to
predict exactly which transactions might contribute to the read/write
dependencies and need to be rolled back to prevent serialization
anomalies. The monitoring of read/write dependencies has a cost, as does
the restart of transactions which are terminated with a serialization
failure, but balanced against the cost and blocking involved in use of
explicit locks and SELECT FOR UPDATE
or SELECT FOR
SHARE
, Serializable transactions are the best performance choice
for some environments.
While PostgreSQL's Serializable transaction isolation level only allows concurrent transactions to commit if it can prove there is a serial order of execution that would produce the same effect, it doesn't always prevent errors from being raised that would not occur in true serial execution. In particular, it is possible to see unique constraint violations caused by conflicts with overlapping Serializable transactions even after explicitly checking that the key isn't present before attempting to insert it. This can be avoided by making sure that all Serializable transactions that insert potentially conflicting keys explicitly check if they can do so first. For example, imagine an application that asks the user for a new key and then checks that it doesn't exist already by trying to select it first, or generates a new key by selecting the maximum existing key and adding one. If some Serializable transactions insert new keys directly without following this protocol, unique constraints violations might be reported even in cases where they could not occur in a serial execution of the concurrent transactions.
For optimal performance when relying on Serializable transactions for concurrency control, these issues should be considered:
Declare transactions as READ ONLY
when possible.
Control the number of active connections, using a connection pool if needed. This is always an important performance consideration, but it can be particularly important in a busy system using Serializable transactions.
Don't put more into a single transaction than needed for integrity purposes.
Don't leave connections dangling “idle in transaction” longer than necessary. The configuration parameter idle_in_transaction_session_timeout may be used to automatically disconnect lingering sessions.
Eliminate explicit locks, SELECT FOR UPDATE
, and
SELECT FOR SHARE
where no longer needed due to the
protections automatically provided by Serializable transactions.
When the system is forced to combine multiple page-level predicate locks into a single relation-level predicate lock because the predicate lock table is short of memory, an increase in the rate of serialization failures may occur. You can avoid this by increasing max_pred_locks_per_transaction, max_pred_locks_per_relation, and/or max_pred_locks_per_page.
A sequential scan will always necessitate a relation-level predicate lock. This can result in an increased rate of serialization failures. It may be helpful to encourage the use of index scans by reducing random_page_cost and/or increasing cpu_tuple_cost. Be sure to weigh any decrease in transaction rollbacks and restarts against any overall change in query execution time.
The Serializable isolation level is implemented using a technique known in academic database literature as Serializable Snapshot Isolation, which builds on Snapshot Isolation by adding checks for serialization anomalies. Some differences in behavior and performance may be observed when compared with other systems that use a traditional locking technique. Please see [ports12] for detailed information.
PostgreSQL provides various lock modes
to control concurrent access to data in tables. These modes can
be used for application-controlled locking in situations where
MVCC does not give the desired behavior. Also,
most PostgreSQL commands automatically
acquire locks of appropriate modes to ensure that referenced
tables are not dropped or modified in incompatible ways while the
command executes. (For example, TRUNCATE
cannot safely be
executed concurrently with other operations on the same table, so it
obtains an ACCESS EXCLUSIVE
lock on the table to
enforce that.)
To examine a list of the currently outstanding locks in a database
server, use the
pg_locks
system view. For more information on monitoring the status of the lock
manager subsystem, refer to Chapter 28.
The list below shows the available lock modes and the contexts in
which they are used automatically by
PostgreSQL. You can also acquire any
of these locks explicitly with the command LOCK.
Remember that all of these lock modes are table-level locks,
even if the name contains the word
“row”; the names of the lock modes are historical.
To some extent the names reflect the typical usage of each lock
mode — but the semantics are all the same. The only real difference
between one lock mode and another is the set of lock modes with
which each conflicts (see Table 13.2).
Two transactions cannot hold locks of conflicting
modes on the same table at the same time. (However, a transaction
never conflicts with itself. For example, it might acquire
ACCESS EXCLUSIVE
lock and later acquire
ACCESS SHARE
lock on the same table.) Non-conflicting
lock modes can be held concurrently by many transactions. Notice in
particular that some lock modes are self-conflicting (for example,
an ACCESS EXCLUSIVE
lock cannot be held by more than one
transaction at a time) while others are not self-conflicting (for example,
an ACCESS SHARE
lock can be held by multiple transactions).
Table-Level Lock Modes
ACCESS SHARE
(AccessShareLock
)
Conflicts with the ACCESS EXCLUSIVE
lock
mode only.
The SELECT
command acquires a lock of this mode on
referenced tables. In general, any query that only reads a table
and does not modify it will acquire this lock mode.
ROW SHARE
(RowShareLock
)
Conflicts with the EXCLUSIVE
and
ACCESS EXCLUSIVE
lock modes.
The SELECT FOR UPDATE
and
SELECT FOR SHARE
commands acquire a
lock of this mode on the target table(s) (in addition to
ACCESS SHARE
locks on any other tables
that are referenced but not selected
FOR UPDATE/FOR SHARE
).
ROW EXCLUSIVE
(RowExclusiveLock
)
Conflicts with the SHARE
, SHARE ROW
EXCLUSIVE
, EXCLUSIVE
, and
ACCESS EXCLUSIVE
lock modes.
The commands UPDATE
,
DELETE
, and INSERT
acquire this lock mode on the target table (in addition to
ACCESS SHARE
locks on any other referenced
tables). In general, this lock mode will be acquired by any
command that modifies data in a table.
SHARE UPDATE EXCLUSIVE
(ShareUpdateExclusiveLock
)
Conflicts with the SHARE UPDATE EXCLUSIVE
,
SHARE
, SHARE ROW
EXCLUSIVE
, EXCLUSIVE
, and
ACCESS EXCLUSIVE
lock modes.
This mode protects a table against
concurrent schema changes and VACUUM
runs.
Acquired by VACUUM
(without FULL
),
ANALYZE
, CREATE INDEX CONCURRENTLY
,
CREATE STATISTICS
, COMMENT ON
,
REINDEX CONCURRENTLY
,
and certain ALTER INDEX
and ALTER TABLE
variants
(for full details see the documentation of these commands).
SHARE
(ShareLock
)
Conflicts with the ROW EXCLUSIVE
,
SHARE UPDATE EXCLUSIVE
, SHARE ROW
EXCLUSIVE
, EXCLUSIVE
, and
ACCESS EXCLUSIVE
lock modes.
This mode protects a table against concurrent data changes.
Acquired by CREATE INDEX
(without CONCURRENTLY
).
SHARE ROW EXCLUSIVE
(ShareRowExclusiveLock
)
Conflicts with the ROW EXCLUSIVE
,
SHARE UPDATE EXCLUSIVE
,
SHARE
, SHARE ROW
EXCLUSIVE
, EXCLUSIVE
, and
ACCESS EXCLUSIVE
lock modes.
This mode protects a table against concurrent data changes, and
is self-exclusive so that only one session can hold it at a time.
Acquired by CREATE TRIGGER
and some forms of
ALTER TABLE
.
EXCLUSIVE
(ExclusiveLock
)
Conflicts with the ROW SHARE
, ROW
EXCLUSIVE
, SHARE UPDATE
EXCLUSIVE
, SHARE
, SHARE
ROW EXCLUSIVE
, EXCLUSIVE
, and
ACCESS EXCLUSIVE
lock modes.
This mode allows only concurrent ACCESS SHARE
locks,
i.e., only reads from the table can proceed in parallel with a
transaction holding this lock mode.
Acquired by REFRESH MATERIALIZED VIEW CONCURRENTLY
.
ACCESS EXCLUSIVE
(AccessExclusiveLock
)
Conflicts with locks of all modes (ACCESS
SHARE
, ROW SHARE
, ROW
EXCLUSIVE
, SHARE UPDATE
EXCLUSIVE
, SHARE
, SHARE
ROW EXCLUSIVE
, EXCLUSIVE
, and
ACCESS EXCLUSIVE
).
This mode guarantees that the
holder is the only transaction accessing the table in any way.
Acquired by the DROP TABLE
,
TRUNCATE
, REINDEX
,
CLUSTER
, VACUUM FULL
,
and REFRESH MATERIALIZED VIEW
(without
CONCURRENTLY
)
commands. Many forms of ALTER INDEX
and ALTER TABLE
also acquire
a lock at this level. This is also the default lock mode for
LOCK TABLE
statements that do not specify
a mode explicitly.
Only an ACCESS EXCLUSIVE
lock blocks a
SELECT
(without FOR UPDATE/SHARE
)
statement.
Once acquired, a lock is normally held until the end of the transaction. But if a
lock is acquired after establishing a savepoint, the lock is released
immediately if the savepoint is rolled back to. This is consistent with
the principle that ROLLBACK
cancels all effects of the
commands since the savepoint. The same holds for locks acquired within a
PL/pgSQL exception block: an error escape from the block
releases locks acquired within it.
Table 13.2. Conflicting Lock Modes
Requested Lock Mode | Existing Lock Mode | |||||||
---|---|---|---|---|---|---|---|---|
ACCESS SHARE | ROW SHARE | ROW EXCL. | SHARE UPDATE EXCL. | SHARE | SHARE ROW EXCL. | EXCL. | ACCESS EXCL. | |
ACCESS SHARE | X | |||||||
ROW SHARE | X | X | ||||||
ROW EXCL. | X | X | X | X | ||||
SHARE UPDATE EXCL. | X | X | X | X | X | |||
SHARE | X | X | X | X | X | |||
SHARE ROW EXCL. | X | X | X | X | X | X | ||
EXCL. | X | X | X | X | X | X | X | |
ACCESS EXCL. | X | X | X | X | X | X | X | X |
In addition to table-level locks, there are row-level locks, which are listed as below with the contexts in which they are used automatically by PostgreSQL. See Table 13.3 for a complete table of row-level lock conflicts. Note that a transaction can hold conflicting locks on the same row, even in different subtransactions; but other than that, two transactions can never hold conflicting locks on the same row. Row-level locks do not affect data querying; they block only writers and lockers to the same row. Row-level locks are released at transaction end or during savepoint rollback, just like table-level locks.
Row-Level Lock Modes
FOR UPDATE
FOR UPDATE
causes the rows retrieved by the
SELECT
statement to be locked as though for
update. This prevents them from being locked, modified or deleted by
other transactions until the current transaction ends. That is,
other transactions that attempt UPDATE
,
DELETE
,
SELECT FOR UPDATE
,
SELECT FOR NO KEY UPDATE
,
SELECT FOR SHARE
or
SELECT FOR KEY SHARE
of these rows will be blocked until the current transaction ends;
conversely, SELECT FOR UPDATE
will wait for a
concurrent transaction that has run any of those commands on the
same row,
and will then lock and return the updated row (or no row, if the
row was deleted). Within a REPEATABLE READ
or
SERIALIZABLE
transaction,
however, an error will be thrown if a row to be locked has changed
since the transaction started. For further discussion see
Section 13.4.
The FOR UPDATE
lock mode
is also acquired by any DELETE
on a row, and also by an
UPDATE
that modifies the values of certain columns. Currently,
the set of columns considered for the UPDATE
case are those that
have a unique index on them that can be used in a foreign key (so partial
indexes and expressional indexes are not considered), but this may change
in the future.
FOR NO KEY UPDATE
Behaves similarly to FOR UPDATE
, except that the lock
acquired is weaker: this lock will not block
SELECT FOR KEY SHARE
commands that attempt to acquire
a lock on the same rows. This lock mode is also acquired by any
UPDATE
that does not acquire a FOR UPDATE
lock.
FOR SHARE
Behaves similarly to FOR NO KEY UPDATE
, except that it
acquires a shared lock rather than exclusive lock on each retrieved
row. A shared lock blocks other transactions from performing
UPDATE
, DELETE
,
SELECT FOR UPDATE
or
SELECT FOR NO KEY UPDATE
on these rows, but it does not
prevent them from performing SELECT FOR SHARE
or
SELECT FOR KEY SHARE
.
FOR KEY SHARE
Behaves similarly to FOR SHARE
, except that the
lock is weaker: SELECT FOR UPDATE
is blocked, but not
SELECT FOR NO KEY UPDATE
. A key-shared lock blocks
other transactions from performing DELETE
or
any UPDATE
that changes the key values, but not
other UPDATE
, and neither does it prevent
SELECT FOR NO KEY UPDATE
, SELECT FOR SHARE
,
or SELECT FOR KEY SHARE
.
PostgreSQL doesn't remember any
information about modified rows in memory, so there is no limit on
the number of rows locked at one time. However, locking a row
might cause a disk write, e.g., SELECT FOR
UPDATE
modifies selected rows to mark them locked, and so
will result in disk writes.
Table 13.3. Conflicting Row-Level Locks
Requested Lock Mode | Current Lock Mode | |||
---|---|---|---|---|
FOR KEY SHARE | FOR SHARE | FOR NO KEY UPDATE | FOR UPDATE | |
FOR KEY SHARE | X | |||
FOR SHARE | X | X | ||
FOR NO KEY UPDATE | X | X | X | |
FOR UPDATE | X | X | X | X |
In addition to table and row locks, page-level share/exclusive locks are used to control read/write access to table pages in the shared buffer pool. These locks are released immediately after a row is fetched or updated. Application developers normally need not be concerned with page-level locks, but they are mentioned here for completeness.
The use of explicit locking can increase the likelihood of deadlocks, wherein two (or more) transactions each hold locks that the other wants. For example, if transaction 1 acquires an exclusive lock on table A and then tries to acquire an exclusive lock on table B, while transaction 2 has already exclusive-locked table B and now wants an exclusive lock on table A, then neither one can proceed. PostgreSQL automatically detects deadlock situations and resolves them by aborting one of the transactions involved, allowing the other(s) to complete. (Exactly which transaction will be aborted is difficult to predict and should not be relied upon.)
Note that deadlocks can also occur as the result of row-level locks (and thus, they can occur even if explicit locking is not used). Consider the case in which two concurrent transactions modify a table. The first transaction executes:
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 11111;
This acquires a row-level lock on the row with the specified account number. Then, the second transaction executes:
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 22222; UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 11111;
The first UPDATE
statement successfully
acquires a row-level lock on the specified row, so it succeeds in
updating that row. However, the second UPDATE
statement finds that the row it is attempting to update has
already been locked, so it waits for the transaction that
acquired the lock to complete. Transaction two is now waiting on
transaction one to complete before it continues execution. Now,
transaction one executes:
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222;
Transaction one attempts to acquire a row-level lock on the specified row, but it cannot: transaction two already holds such a lock. So it waits for transaction two to complete. Thus, transaction one is blocked on transaction two, and transaction two is blocked on transaction one: a deadlock condition. PostgreSQL will detect this situation and abort one of the transactions.
The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order. In the example above, if both transactions had updated the rows in the same order, no deadlock would have occurred. One should also ensure that the first lock acquired on an object in a transaction is the most restrictive mode that will be needed for that object. If it is not feasible to verify this in advance, then deadlocks can be handled on-the-fly by retrying transactions that abort due to deadlocks.
So long as no deadlock situation is detected, a transaction seeking either a table-level or row-level lock will wait indefinitely for conflicting locks to be released. This means it is a bad idea for applications to hold transactions open for long periods of time (e.g., while waiting for user input).
PostgreSQL provides a means for creating locks that have application-defined meanings. These are called advisory locks, because the system does not enforce their use — it is up to the application to use them correctly. Advisory locks can be useful for locking strategies that are an awkward fit for the MVCC model. For example, a common use of advisory locks is to emulate pessimistic locking strategies typical of so-called “flat file” data management systems. While a flag stored in a table could be used for the same purpose, advisory locks are faster, avoid table bloat, and are automatically cleaned up by the server at the end of the session.
There are two ways to acquire an advisory lock in PostgreSQL: at session level or at transaction level. Once acquired at session level, an advisory lock is held until explicitly released or the session ends. Unlike standard lock requests, session-level advisory lock requests do not honor transaction semantics: a lock acquired during a transaction that is later rolled back will still be held following the rollback, and likewise an unlock is effective even if the calling transaction fails later. A lock can be acquired multiple times by its owning process; for each completed lock request there must be a corresponding unlock request before the lock is actually released. Transaction-level lock requests, on the other hand, behave more like regular lock requests: they are automatically released at the end of the transaction, and there is no explicit unlock operation. This behavior is often more convenient than the session-level behavior for short-term usage of an advisory lock. Session-level and transaction-level lock requests for the same advisory lock identifier will block each other in the expected way. If a session already holds a given advisory lock, additional requests by it will always succeed, even if other sessions are awaiting the lock; this statement is true regardless of whether the existing lock hold and new request are at session level or transaction level.
Like all locks in
PostgreSQL, a complete list of advisory locks
currently held by any session can be found in the pg_locks
system
view.
Both advisory locks and regular locks are stored in a shared memory pool whose size is defined by the configuration variables max_locks_per_transaction and max_connections. Care must be taken not to exhaust this memory or the server will be unable to grant any locks at all. This imposes an upper limit on the number of advisory locks grantable by the server, typically in the tens to hundreds of thousands depending on how the server is configured.
In certain cases using advisory locking methods, especially in queries
involving explicit ordering and LIMIT
clauses, care must be
taken to control the locks acquired because of the order in which SQL
expressions are evaluated. For example:
SELECT pg_advisory_lock(id) FROM foo WHERE id = 12345; -- ok SELECT pg_advisory_lock(id) FROM foo WHERE id > 12345 LIMIT 100; -- danger! SELECT pg_advisory_lock(q.id) FROM ( SELECT id FROM foo WHERE id > 12345 LIMIT 100 ) q; -- ok
In the above queries, the second form is dangerous because the
LIMIT
is not guaranteed to be applied before the locking
function is executed. This might cause some locks to be acquired
that the application was not expecting, and hence would fail to release
(until it ends the session).
From the point of view of the application, such locks
would be dangling, although still viewable in
pg_locks
.
The functions provided to manipulate advisory locks are described in Section 9.27.10.
It is very difficult to enforce business rules regarding data integrity using Read Committed transactions because the view of the data is shifting with each statement, and even a single statement may not restrict itself to the statement's snapshot if a write conflict occurs.
While a Repeatable Read transaction has a stable view of the data throughout its execution, there is a subtle issue with using MVCC snapshots for data consistency checks, involving something known as read/write conflicts. If one transaction writes data and a concurrent transaction attempts to read the same data (whether before or after the write), it cannot see the work of the other transaction. The reader then appears to have executed first regardless of which started first or which committed first. If that is as far as it goes, there is no problem, but if the reader also writes data which is read by a concurrent transaction there is now a transaction which appears to have run before either of the previously mentioned transactions. If the transaction which appears to have executed last actually commits first, it is very easy for a cycle to appear in a graph of the order of execution of the transactions. When such a cycle appears, integrity checks will not work correctly without some help.
As mentioned in Section 13.2.3, Serializable transactions are just Repeatable Read transactions which add nonblocking monitoring for dangerous patterns of read/write conflicts. When a pattern is detected which could cause a cycle in the apparent order of execution, one of the transactions involved is rolled back to break the cycle.
If the Serializable transaction isolation level is used for all writes and for all reads which need a consistent view of the data, no other effort is required to ensure consistency. Software from other environments which is written to use serializable transactions to ensure consistency should “just work” in this regard in PostgreSQL.
When using this technique, it will avoid creating an unnecessary burden
for application programmers if the application software goes through a
framework which automatically retries transactions which are rolled
back with a serialization failure. It may be a good idea to set
default_transaction_isolation
to serializable
.
It would also be wise to take some action to ensure that no other
transaction isolation level is used, either inadvertently or to
subvert integrity checks, through checks of the transaction isolation
level in triggers.
See Section 13.2.3 for performance suggestions.
This level of integrity protection using Serializable transactions does not yet extend to hot standby mode (Section 27.4). Because of that, those using hot standby may want to use Repeatable Read and explicit locking on the primary.
When non-serializable writes are possible,
to ensure the current validity of a row and protect it against
concurrent updates one must use SELECT FOR UPDATE
,
SELECT FOR SHARE
, or an appropriate LOCK
TABLE
statement. (SELECT FOR UPDATE
and SELECT FOR SHARE
lock just the
returned rows against concurrent updates, while LOCK
TABLE
locks the whole table.) This should be taken into
account when porting applications to
PostgreSQL from other environments.
Also of note to those converting from other environments is the fact
that SELECT FOR UPDATE
does not ensure that a
concurrent transaction will not update or delete a selected row.
To do that in PostgreSQL you must actually
update the row, even if no values need to be changed.
SELECT FOR UPDATE
temporarily blocks
other transactions from acquiring the same lock or executing an
UPDATE
or DELETE
which would
affect the locked row, but once the transaction holding this lock
commits or rolls back, a blocked transaction will proceed with the
conflicting operation unless an actual UPDATE
of
the row was performed while the lock was held.
Global validity checks require extra thought under
non-serializable MVCC.
For example, a banking application might wish to check that the sum of
all credits in one table equals the sum of debits in another table,
when both tables are being actively updated. Comparing the results of two
successive SELECT sum(...)
commands will not work reliably in
Read Committed mode, since the second query will likely include the results
of transactions not counted by the first. Doing the two sums in a
single repeatable read transaction will give an accurate picture of only the
effects of transactions that committed before the repeatable read transaction
started — but one might legitimately wonder whether the answer is still
relevant by the time it is delivered. If the repeatable read transaction
itself applied some changes before trying to make the consistency check,
the usefulness of the check becomes even more debatable, since now it
includes some but not all post-transaction-start changes. In such cases
a careful person might wish to lock all tables needed for the check,
in order to get an indisputable picture of current reality. A
SHARE
mode (or higher) lock guarantees that there are no
uncommitted changes in the locked table, other than those of the current
transaction.
Note also that if one is relying on explicit locking to prevent concurrent
changes, one should either use Read Committed mode, or in Repeatable Read
mode be careful to obtain
locks before performing queries. A lock obtained by a
repeatable read transaction guarantees that no other transactions modifying
the table are still running, but if the snapshot seen by the
transaction predates obtaining the lock, it might predate some now-committed
changes in the table. A repeatable read transaction's snapshot is actually
frozen at the start of its first query or data-modification command
(SELECT
, INSERT
,
UPDATE
, or DELETE
), so
it is possible to obtain locks explicitly before the snapshot is
frozen.
Some DDL commands, currently only TRUNCATE
and the
table-rewriting forms of ALTER TABLE
, are not
MVCC-safe. This means that after the truncation or rewrite commits, the
table will appear empty to concurrent transactions, if they are using a
snapshot taken before the DDL command committed. This will only be an
issue for a transaction that did not access the table in question
before the DDL command started — any transaction that has done so
would hold at least an ACCESS SHARE
table lock,
which would block the DDL command until that transaction completes.
So these commands will not cause any apparent inconsistency in the
table contents for successive queries on the target table, but they
could cause visible inconsistency between the contents of the target
table and other tables in the database.
Support for the Serializable transaction isolation level has not yet been added to Hot Standby replication targets (described in Section 27.4). The strictest isolation level currently supported in hot standby mode is Repeatable Read. While performing all permanent database writes within Serializable transactions on the primary will ensure that all standbys will eventually reach a consistent state, a Repeatable Read transaction run on the standby can sometimes see a transient state that is inconsistent with any serial execution of the transactions on the primary.
Internal access to the system catalogs is not done using the isolation level of the current transaction. This means that newly created database objects such as tables are visible to concurrent Repeatable Read and Serializable transactions, even though the rows they contain are not. In contrast, queries that explicitly examine the system catalogs don't see rows representing concurrently created database objects, in the higher isolation levels.
Though PostgreSQL provides nonblocking read/write access to table data, nonblocking read/write access is not currently offered for every index access method implemented in PostgreSQL. The various index types are handled as follows:
Short-term share/exclusive page-level locks are used for read/write access. Locks are released immediately after each index row is fetched or inserted. These index types provide the highest concurrency without deadlock conditions.
Share/exclusive hash-bucket-level locks are used for read/write access. Locks are released after the whole bucket is processed. Bucket-level locks provide better concurrency than index-level ones, but deadlock is possible since the locks are held longer than one index operation.
Short-term share/exclusive page-level locks are used for read/write access. Locks are released immediately after each index row is fetched or inserted. But note that insertion of a GIN-indexed value usually produces several index key insertions per row, so GIN might do substantial work for a single value's insertion.
Currently, B-tree indexes offer the best performance for concurrent applications; since they also have more features than hash indexes, they are the recommended index type for concurrent applications that need to index scalar data. When dealing with non-scalar data, B-trees are not useful, and GiST, SP-GiST or GIN indexes should be used instead.
Table of Contents
Query performance can be affected by many things. Some of these can be controlled by the user, while others are fundamental to the underlying design of the system. This chapter provides some hints about understanding and tuning PostgreSQL performance.
EXPLAIN
PostgreSQL devises a query
plan for each query it receives. Choosing the right
plan to match the query structure and the properties of the data
is absolutely critical for good performance, so the system includes
a complex planner that tries to choose good plans.
You can use the EXPLAIN
command
to see what query plan the planner creates for any query.
Plan-reading is an art that requires some experience to master,
but this section attempts to cover the basics.
Examples in this section are drawn from the regression test database
after doing a VACUUM ANALYZE
, using 9.3 development sources.
You should be able to get similar results if you try the examples
yourself, but your estimated costs and row counts might vary slightly
because ANALYZE
's statistics are random samples rather
than exact, and because costs are inherently somewhat platform-dependent.
The examples use EXPLAIN
's default “text” output
format, which is compact and convenient for humans to read.
If you want to feed EXPLAIN
's output to a program for further
analysis, you should use one of its machine-readable output formats
(XML, JSON, or YAML) instead.
EXPLAIN
Basics
The structure of a query plan is a tree of plan nodes.
Nodes at the bottom level of the tree are scan nodes: they return raw rows
from a table. There are different types of scan nodes for different
table access methods: sequential scans, index scans, and bitmap index
scans. There are also non-table row sources, such as VALUES
clauses and set-returning functions in FROM
, which have their
own scan node types.
If the query requires joining, aggregation, sorting, or other
operations on the raw rows, then there will be additional nodes
above the scan nodes to perform these operations. Again,
there is usually more than one possible way to do these operations,
so different node types can appear here too. The output
of EXPLAIN
has one line for each node in the plan
tree, showing the basic node type plus the cost estimates that the planner
made for the execution of that plan node. Additional lines might appear,
indented from the node's summary line,
to show additional properties of the node.
The very first line (the summary line for the topmost
node) has the estimated total execution cost for the plan; it is this
number that the planner seeks to minimize.
Here is a trivial example, just to show what the output looks like:
EXPLAIN SELECT * FROM tenk1; QUERY PLAN ------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
Since this query has no WHERE
clause, it must scan all the
rows of the table, so the planner has chosen to use a simple sequential
scan plan. The numbers that are quoted in parentheses are (left
to right):
Estimated start-up cost. This is the time expended before the output phase can begin, e.g., time to do the sorting in a sort node.
Estimated total cost. This is stated on the assumption that the plan
node is run to completion, i.e., all available rows are retrieved.
In practice a node's parent node might stop short of reading all
available rows (see the LIMIT
example below).
Estimated number of rows output by this plan node. Again, the node is assumed to be run to completion.
Estimated average width of rows output by this plan node (in bytes).
The costs are measured in arbitrary units determined by the planner's
cost parameters (see Section 20.7.2).
Traditional practice is to measure the costs in units of disk page
fetches; that is, seq_page_cost is conventionally
set to 1.0
and the other cost parameters are set relative
to that. The examples in this section are run with the default cost
parameters.
It's important to understand that the cost of an upper-level node includes the cost of all its child nodes. It's also important to realize that the cost only reflects things that the planner cares about. In particular, the cost does not consider the time spent transmitting result rows to the client, which could be an important factor in the real elapsed time; but the planner ignores it because it cannot change it by altering the plan. (Every correct plan will output the same row set, we trust.)
The rows
value is a little tricky because it is
not the number of rows processed or scanned by the
plan node, but rather the number emitted by the node. This is often
less than the number scanned, as a result of filtering by any
WHERE
-clause conditions that are being applied at the node.
Ideally the top-level rows estimate will approximate the number of rows
actually returned, updated, or deleted by the query.
Returning to our example:
EXPLAIN SELECT * FROM tenk1; QUERY PLAN ------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
These numbers are derived very straightforwardly. If you do:
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
you will find that tenk1
has 358 disk
pages and 10000 rows. The estimated cost is computed as (disk pages read *
seq_page_cost) + (rows scanned *
cpu_tuple_cost). By default,
seq_page_cost
is 1.0 and cpu_tuple_cost
is 0.01,
so the estimated cost is (358 * 1.0) + (10000 * 0.01) = 458.
Now let's modify the query to add a WHERE
condition:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 7000; QUERY PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..483.00 rows=7001 width=244) Filter: (unique1 < 7000)
Notice that the EXPLAIN
output shows the WHERE
clause being applied as a “filter” condition attached to the Seq
Scan plan node. This means that
the plan node checks the condition for each row it scans, and outputs
only the ones that pass the condition.
The estimate of output rows has been reduced because of the
WHERE
clause.
However, the scan will still have to visit all 10000 rows, so the cost
hasn't decreased; in fact it has gone up a bit (by 10000 * cpu_operator_cost, to be exact) to reflect the extra CPU
time spent checking the WHERE
condition.
The actual number of rows this query would select is 7000, but the rows
estimate is only approximate. If you try to duplicate this experiment,
you will probably get a slightly different estimate; moreover, it can
change after each ANALYZE
command, because the
statistics produced by ANALYZE
are taken from a
randomized sample of the table.
Now, let's make the condition more restrictive:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100; QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on tenk1 (cost=5.07..229.20 rows=101 width=244) Recheck Cond: (unique1 < 100) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) Index Cond: (unique1 < 100)
Here the planner has decided to use a two-step plan: the child plan node visits an index to find the locations of rows matching the index condition, and then the upper plan node actually fetches those rows from the table itself. Fetching rows separately is much more expensive than reading them sequentially, but because not all the pages of the table have to be visited, this is still cheaper than a sequential scan. (The reason for using two plan levels is that the upper plan node sorts the row locations identified by the index into physical order before reading them, to minimize the cost of separate fetches. The “bitmap” mentioned in the node names is the mechanism that does the sorting.)
Now let's add another condition to the WHERE
clause:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND stringu1 = 'xxx'; QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on tenk1 (cost=5.04..229.43 rows=1 width=244) Recheck Cond: (unique1 < 100) Filter: (stringu1 = 'xxx'::name) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) Index Cond: (unique1 < 100)
The added condition stringu1 = 'xxx'
reduces the
output row count estimate, but not the cost because we still have to visit
the same set of rows. Notice that the stringu1
clause
cannot be applied as an index condition, since this index is only on
the unique1
column. Instead it is applied as a filter on
the rows retrieved by the index. Thus the cost has actually gone up
slightly to reflect this extra checking.
In some cases the planner will prefer a “simple” index scan plan:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42; QUERY PLAN ----------------------------------------------------------------------------- Index Scan using tenk1_unique1 on tenk1 (cost=0.29..8.30 rows=1 width=244) Index Cond: (unique1 = 42)
In this type of plan the table rows are fetched in index order, which
makes them even more expensive to read, but there are so few that the
extra cost of sorting the row locations is not worth it. You'll most
often see this plan type for queries that fetch just a single row. It's
also often used for queries that have an ORDER BY
condition
that matches the index order, because then no extra sorting step is needed
to satisfy the ORDER BY
. In this example, adding
ORDER BY unique1
would use the same plan because the
index already implicitly provides the requested ordering.
The planner may implement an ORDER BY
clause in several
ways. The above example shows that such an ordering clause may be
implemented implicitly. The planner may also add an explicit
sort
step:
EXPLAIN SELECT * FROM tenk1 ORDER BY unique1; QUERY PLAN ------------------------------------------------------------------- Sort (cost=1109.39..1134.39 rows=10000 width=244) Sort Key: unique1 -> Seq Scan on tenk1 (cost=0.00..445.00 rows=10000 width=244)
If a part of the plan guarantees an ordering on a prefix of the
required sort keys, then the planner may instead decide to use an
incremental sort
step:
EXPLAIN SELECT * FROM tenk1 ORDER BY four, ten LIMIT 100; QUERY PLAN ------------------------------------------------------------------------------------------------------ Limit (cost=521.06..538.05 rows=100 width=244) -> Incremental Sort (cost=521.06..2220.95 rows=10000 width=244) Sort Key: four, ten Presorted Key: four -> Index Scan using index_tenk1_on_four on tenk1 (cost=0.29..1510.08 rows=10000 width=244)
Compared to regular sorts, sorting incrementally allows returning tuples
before the entire result set has been sorted, which particularly enables
optimizations with LIMIT
queries. It may also reduce
memory usage and the likelihood of spilling sorts to disk, but it comes at
the cost of the increased overhead of splitting the result set into multiple
sorting batches.
If there are separate indexes on several of the columns referenced
in WHERE
, the planner might choose to use an AND or OR
combination of the indexes:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000; QUERY PLAN ------------------------------------------------------------------------------------- Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244) Recheck Cond: ((unique1 < 100) AND (unique2 > 9000)) -> BitmapAnd (cost=25.08..25.08 rows=10 width=0) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) Index Cond: (unique1 < 100) -> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0) Index Cond: (unique2 > 9000)
But this requires visiting both indexes, so it's not necessarily a win compared to using just one index and treating the other condition as a filter. If you vary the ranges involved you'll see the plan change accordingly.
Here is an example showing the effects of LIMIT
:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 LIMIT 2; QUERY PLAN ------------------------------------------------------------------------------------- Limit (cost=0.29..14.48 rows=2 width=244) -> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..71.27 rows=10 width=244) Index Cond: (unique2 > 9000) Filter: (unique1 < 100)
This is the same query as above, but we added a LIMIT
so that
not all the rows need be retrieved, and the planner changed its mind about
what to do. Notice that the total cost and row count of the Index Scan
node are shown as if it were run to completion. However, the Limit node
is expected to stop after retrieving only a fifth of those rows, so its
total cost is only a fifth as much, and that's the actual estimated cost
of the query. This plan is preferred over adding a Limit node to the
previous plan because the Limit could not avoid paying the startup cost
of the bitmap scan, so the total cost would be something over 25 units
with that approach.
Let's try joining two tables, using the columns we have been discussing:
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; QUERY PLAN -------------------------------------------------------------------------------------- Nested Loop (cost=4.65..118.62 rows=10 width=488) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) Index Cond: (unique1 < 10) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) Index Cond: (unique2 = t1.unique2)
In this plan, we have a nested-loop join node with two table scans as
inputs, or children. The indentation of the node summary lines reflects
the plan tree structure. The join's first, or “outer”, child
is a bitmap scan similar to those we saw before. Its cost and row count
are the same as we'd get from SELECT ... WHERE unique1 < 10
because we are
applying the WHERE
clause unique1 < 10
at that node.
The t1.unique2 = t2.unique2
clause is not relevant yet,
so it doesn't affect the row count of the outer scan. The nested-loop
join node will run its second,
or “inner” child once for each row obtained from the outer child.
Column values from the current outer row can be plugged into the inner
scan; here, the t1.unique2
value from the outer row is available,
so we get a plan and costs similar to what we saw above for a simple
SELECT ... WHERE t2.unique2 =
case.
(The estimated cost is actually a bit lower than what was seen above,
as a result of caching that's expected to occur during the repeated
index scans on constant
t2
.) The
costs of the loop node are then set on the basis of the cost of the outer
scan, plus one repetition of the inner scan for each outer row (10 * 7.91,
here), plus a little CPU time for join processing.
In this example the join's output row count is the same as the product
of the two scans' row counts, but that's not true in all cases because
there can be additional WHERE
clauses that mention both tables
and so can only be applied at the join point, not to either input scan.
Here's an example:
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 10 AND t2.unique2 < 10 AND t1.hundred < t2.hundred; QUERY PLAN --------------------------------------------------------------------------------------------- Nested Loop (cost=4.65..49.46 rows=33 width=488) Join Filter: (t1.hundred < t2.hundred) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) Index Cond: (unique1 < 10) -> Materialize (cost=0.29..8.51 rows=10 width=244) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..8.46 rows=10 width=244) Index Cond: (unique2 < 10)
The condition t1.hundred < t2.hundred
can't be
tested in the tenk2_unique2
index, so it's applied at the
join node. This reduces the estimated output row count of the join node,
but does not change either input scan.
Notice that here the planner has chosen to “materialize” the inner
relation of the join, by putting a Materialize plan node atop it. This
means that the t2
index scan will be done just once, even
though the nested-loop join node needs to read that data ten times, once
for each row from the outer relation. The Materialize node saves the data
in memory as it's read, and then returns the data from memory on each
subsequent pass.
When dealing with outer joins, you might see join plan nodes with both
“Join Filter” and plain “Filter” conditions attached.
Join Filter conditions come from the outer join's ON
clause,
so a row that fails the Join Filter condition could still get emitted as
a null-extended row. But a plain Filter condition is applied after the
outer-join rules and so acts to remove rows unconditionally. In an inner
join there is no semantic difference between these types of filters.
If we change the query's selectivity a bit, we might get a very different join plan:
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2; QUERY PLAN ------------------------------------------------------------------------------------------ Hash Join (cost=230.47..713.98 rows=101 width=488) Hash Cond: (t2.unique2 = t1.unique2) -> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000 width=244) -> Hash (cost=229.20..229.20 rows=101 width=244) -> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20 rows=101 width=244) Recheck Cond: (unique1 < 100) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) Index Cond: (unique1 < 100)
Here, the planner has chosen to use a hash join, in which rows of one
table are entered into an in-memory hash table, after which the other
table is scanned and the hash table is probed for matches to each row.
Again note how the indentation reflects the plan structure: the bitmap
scan on tenk1
is the input to the Hash node, which constructs
the hash table. That's then returned to the Hash Join node, which reads
rows from its outer child plan and searches the hash table for each one.
Another possible type of join is a merge join, illustrated here:
EXPLAIN SELECT * FROM tenk1 t1, onek t2 WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2; QUERY PLAN ------------------------------------------------------------------------------------------ Merge Join (cost=198.11..268.19 rows=10 width=488) Merge Cond: (t1.unique2 = t2.unique2) -> Index Scan using tenk1_unique2 on tenk1 t1 (cost=0.29..656.28 rows=101 width=244) Filter: (unique1 < 100) -> Sort (cost=197.83..200.33 rows=1000 width=244) Sort Key: t2.unique2 -> Seq Scan on onek t2 (cost=0.00..148.00 rows=1000 width=244)
Merge join requires its input data to be sorted on the join keys. In this
plan the tenk1
data is sorted by using an index scan to visit
the rows in the correct order, but a sequential scan and sort is preferred
for onek
, because there are many more rows to be visited in
that table.
(Sequential-scan-and-sort frequently beats an index scan for sorting many rows,
because of the nonsequential disk access required by the index scan.)
One way to look at variant plans is to force the planner to disregard
whatever strategy it thought was the cheapest, using the enable/disable
flags described in Section 20.7.1.
(This is a crude tool, but useful. See
also Section 14.3.)
For example, if we're unconvinced that sequential-scan-and-sort is the best way to
deal with table onek
in the previous example, we could try
SET enable_sort = off; EXPLAIN SELECT * FROM tenk1 t1, onek t2 WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2; QUERY PLAN ------------------------------------------------------------------------------------------ Merge Join (cost=0.56..292.65 rows=10 width=488) Merge Cond: (t1.unique2 = t2.unique2) -> Index Scan using tenk1_unique2 on tenk1 t1 (cost=0.29..656.28 rows=101 width=244) Filter: (unique1 < 100) -> Index Scan using onek_unique2 on onek t2 (cost=0.28..224.79 rows=1000 width=244)
which shows that the planner thinks that sorting onek
by
index-scanning is about 12% more expensive than sequential-scan-and-sort.
Of course, the next question is whether it's right about that.
We can investigate that using EXPLAIN ANALYZE
, as discussed
below.
EXPLAIN ANALYZE
It is possible to check the accuracy of the planner's estimates
by using EXPLAIN
's ANALYZE
option. With this
option, EXPLAIN
actually executes the query, and then displays
the true row counts and true run time accumulated within each plan node,
along with the same estimates that a plain EXPLAIN
shows. For example, we might get a result like this:
EXPLAIN ANALYZE SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1) Index Cond: (unique1 < 10) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10) Index Cond: (unique2 = t1.unique2) Planning time: 0.181 ms Execution time: 0.501 ms
Note that the “actual time” values are in milliseconds of
real time, whereas the cost
estimates are expressed in
arbitrary units; so they are unlikely to match up.
The thing that's usually most important to look for is whether the
estimated row counts are reasonably close to reality. In this example
the estimates were all dead-on, but that's quite unusual in practice.
In some query plans, it is possible for a subplan node to be executed more
than once. For example, the inner index scan will be executed once per
outer row in the above nested-loop plan. In such cases, the
loops
value reports the
total number of executions of the node, and the actual time and rows
values shown are averages per-execution. This is done to make the numbers
comparable with the way that the cost estimates are shown. Multiply by
the loops
value to get the total time actually spent in
the node. In the above example, we spent a total of 0.220 milliseconds
executing the index scans on tenk2
.
In some cases EXPLAIN ANALYZE
shows additional execution
statistics beyond the plan node execution times and row counts.
For example, Sort and Hash nodes provide extra information:
EXPLAIN ANALYZE SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2 ORDER BY t1.fivethous; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=717.34..717.59 rows=101 width=488) (actual time=7.761..7.774 rows=100 loops=1) Sort Key: t1.fivethous Sort Method: quicksort Memory: 77kB -> Hash Join (cost=230.47..713.98 rows=101 width=488) (actual time=0.711..7.427 rows=100 loops=1) Hash Cond: (t2.unique2 = t1.unique2) -> Seq Scan on tenk2 t2 (cost=0.00..445.00 rows=10000 width=244) (actual time=0.007..2.583 rows=10000 loops=1) -> Hash (cost=229.20..229.20 rows=101 width=244) (actual time=0.659..0.659 rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 28kB -> Bitmap Heap Scan on tenk1 t1 (cost=5.07..229.20 rows=101 width=244) (actual time=0.080..0.526 rows=100 loops=1) Recheck Cond: (unique1 < 100) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.049..0.049 rows=100 loops=1) Index Cond: (unique1 < 100) Planning time: 0.194 ms Execution time: 8.008 ms
The Sort node shows the sort method used (in particular, whether the sort was in-memory or on-disk) and the amount of memory or disk space needed. The Hash node shows the number of hash buckets and batches as well as the peak amount of memory used for the hash table. (If the number of batches exceeds one, there will also be disk space usage involved, but that is not shown.)
Another type of extra information is the number of rows removed by a filter condition:
EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE ten < 7; QUERY PLAN --------------------------------------------------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..483.00 rows=7000 width=244) (actual time=0.016..5.107 rows=7000 loops=1) Filter: (ten < 7) Rows Removed by Filter: 3000 Planning time: 0.083 ms Execution time: 5.905 ms
These counts can be particularly valuable for filter conditions applied at join nodes. The “Rows Removed” line only appears when at least one scanned row, or potential join pair in the case of a join node, is rejected by the filter condition.
A case similar to filter conditions occurs with “lossy” index scans. For example, consider this search for polygons containing a specific point:
EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)'; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan on polygon_tbl (cost=0.00..1.05 rows=1 width=32) (actual time=0.044..0.044 rows=0 loops=1) Filter: (f1 @> '((0.5,2))'::polygon) Rows Removed by Filter: 4 Planning time: 0.040 ms Execution time: 0.083 ms
The planner thinks (quite correctly) that this sample table is too small to bother with an index scan, so we have a plain sequential scan in which all the rows got rejected by the filter condition. But if we force an index scan to be used, we see:
SET enable_seqscan TO off; EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using gpolygonind on polygon_tbl (cost=0.13..8.15 rows=1 width=32) (actual time=0.062..0.062 rows=0 loops=1) Index Cond: (f1 @> '((0.5,2))'::polygon) Rows Removed by Index Recheck: 1 Planning time: 0.034 ms Execution time: 0.144 ms
Here we can see that the index returned one candidate row, which was then rejected by a recheck of the index condition. This happens because a GiST index is “lossy” for polygon containment tests: it actually returns the rows with polygons that overlap the target, and then we have to do the exact containment test on those rows.
EXPLAIN
has a BUFFERS
option that can be used with
ANALYZE
to get even more run time statistics:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244) (actual time=0.323..0.342 rows=10 loops=1) Recheck Cond: ((unique1 < 100) AND (unique2 > 9000)) Buffers: shared hit=15 -> BitmapAnd (cost=25.08..25.08 rows=10 width=0) (actual time=0.309..0.309 rows=0 loops=1) Buffers: shared hit=7 -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1) Index Cond: (unique1 < 100) Buffers: shared hit=2 -> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0) (actual time=0.227..0.227 rows=999 loops=1) Index Cond: (unique2 > 9000) Buffers: shared hit=5 Planning time: 0.088 ms Execution time: 0.423 ms
The numbers provided by BUFFERS
help to identify which parts
of the query are the most I/O-intensive.
Keep in mind that because EXPLAIN ANALYZE
actually
runs the query, any side-effects will happen as usual, even though
whatever results the query might output are discarded in favor of
printing the EXPLAIN
data. If you want to analyze a
data-modifying query without changing your tables, you can
roll the command back afterwards, for example:
BEGIN; EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 < 100; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Update on tenk1 (cost=5.08..230.08 rows=0 width=0) (actual time=3.791..3.792 rows=0 loops=1) -> Bitmap Heap Scan on tenk1 (cost=5.08..230.08 rows=102 width=10) (actual time=0.069..0.513 rows=100 loops=1) Recheck Cond: (unique1 < 100) Heap Blocks: exact=90 -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.05 rows=102 width=0) (actual time=0.036..0.037 rows=300 loops=1) Index Cond: (unique1 < 100) Planning Time: 0.113 ms Execution Time: 3.850 ms ROLLBACK;
As seen in this example, when the query is an INSERT
,
UPDATE
, or DELETE
command, the actual work of
applying the table changes is done by a top-level Insert, Update,
or Delete plan node. The plan nodes underneath this node perform
the work of locating the old rows and/or computing the new data.
So above, we see the same sort of bitmap table scan we've seen already,
and its output is fed to an Update node that stores the updated rows.
It's worth noting that although the data-modifying node can take a
considerable amount of run time (here, it's consuming the lion's share
of the time), the planner does not currently add anything to the cost
estimates to account for that work. That's because the work to be done is
the same for every correct query plan, so it doesn't affect planning
decisions.
When an UPDATE
or DELETE
command affects an
inheritance hierarchy, the output might look like this:
EXPLAIN UPDATE parent SET f2 = f2 + 1 WHERE f1 = 101; QUERY PLAN ------------------------------------------------------------------------------------------------------ Update on parent (cost=0.00..24.59 rows=0 width=0) Update on parent parent_1 Update on child1 parent_2 Update on child2 parent_3 Update on child3 parent_4 -> Result (cost=0.00..24.59 rows=4 width=14) -> Append (cost=0.00..24.54 rows=4 width=14) -> Seq Scan on parent parent_1 (cost=0.00..0.00 rows=1 width=14) Filter: (f1 = 101) -> Index Scan using child1_pkey on child1 parent_2 (cost=0.15..8.17 rows=1 width=14) Index Cond: (f1 = 101) -> Index Scan using child2_pkey on child2 parent_3 (cost=0.15..8.17 rows=1 width=14) Index Cond: (f1 = 101) -> Index Scan using child3_pkey on child3 parent_4 (cost=0.15..8.17 rows=1 width=14) Index Cond: (f1 = 101)
In this example the Update node needs to consider three child tables as well as the originally-mentioned parent table. So there are four input scanning subplans, one per table. For clarity, the Update node is annotated to show the specific target tables that will be updated, in the same order as the corresponding subplans.
The Planning time
shown by EXPLAIN
ANALYZE
is the time it took to generate the query plan from the
parsed query and optimize it. It does not include parsing or rewriting.
The Execution time
shown by EXPLAIN
ANALYZE
includes executor start-up and shut-down time, as well
as the time to run any triggers that are fired, but it does not include
parsing, rewriting, or planning time.
Time spent executing BEFORE
triggers, if any, is included in
the time for the related Insert, Update, or Delete node; but time
spent executing AFTER
triggers is not counted there because
AFTER
triggers are fired after completion of the whole plan.
The total time spent in each trigger
(either BEFORE
or AFTER
) is also shown separately.
Note that deferred constraint triggers will not be executed
until end of transaction and are thus not considered at all by
EXPLAIN ANALYZE
.
There are two significant ways in which run times measured by
EXPLAIN ANALYZE
can deviate from normal execution of
the same query. First, since no output rows are delivered to the client,
network transmission costs and I/O conversion costs are not included.
Second, the measurement overhead added by EXPLAIN
ANALYZE
can be significant, especially on machines with slow
gettimeofday()
operating-system calls. You can use the
pg_test_timing tool to measure the overhead of timing
on your system.
EXPLAIN
results should not be extrapolated to situations
much different from the one you are actually testing; for example,
results on a toy-sized table cannot be assumed to apply to large tables.
The planner's cost estimates are not linear and so it might choose
a different plan for a larger or smaller table. An extreme example
is that on a table that only occupies one disk page, you'll nearly
always get a sequential scan plan whether indexes are available or not.
The planner realizes that it's going to take one disk page read to
process the table in any case, so there's no value in expending additional
page reads to look at an index. (We saw this happening in the
polygon_tbl
example above.)
There are cases in which the actual and estimated values won't match up
well, but nothing is really wrong. One such case occurs when
plan node execution is stopped short by a LIMIT
or similar
effect. For example, in the LIMIT
query we used before,
EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 LIMIT 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.29..14.71 rows=2 width=244) (actual time=0.177..0.249 rows=2 loops=1) -> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..72.42 rows=10 width=244) (actual time=0.174..0.244 rows=2 loops=1) Index Cond: (unique2 > 9000) Filter: (unique1 < 100) Rows Removed by Filter: 287 Planning time: 0.096 ms Execution time: 0.336 ms
the estimated cost and row count for the Index Scan node are shown as though it were run to completion. But in reality the Limit node stopped requesting rows after it got two, so the actual row count is only 2 and the run time is less than the cost estimate would suggest. This is not an estimation error, only a discrepancy in the way the estimates and true values are displayed.
Merge joins also have measurement artifacts that can confuse the unwary.
A merge join will stop reading one input if it's exhausted the other input
and the next key value in the one input is greater than the last key value
of the other input; in such a case there can be no more matches and so no
need to scan the rest of the first input. This results in not reading all
of one child, with results like those mentioned for LIMIT
.
Also, if the outer (first) child contains rows with duplicate key values,
the inner (second) child is backed up and rescanned for the portion of its
rows matching that key value. EXPLAIN ANALYZE
counts these
repeated emissions of the same inner rows as if they were real additional
rows. When there are many outer duplicates, the reported actual row count
for the inner child plan node can be significantly larger than the number
of rows that are actually in the inner relation.
BitmapAnd and BitmapOr nodes always report their actual row counts as zero, due to implementation limitations.
Normally, EXPLAIN
will display every plan node
created by the planner. However, there are cases where the executor
can determine that certain nodes need not be executed because they
cannot produce any rows, based on parameter values that were not
available at planning time. (Currently this can only happen for child
nodes of an Append or MergeAppend node that is scanning a partitioned
table.) When this happens, those plan nodes are omitted from
the EXPLAIN
output and a Subplans
Removed:
annotation appears
instead.
N
As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. This section provides a quick look at the statistics that the system uses for these estimates.
One component of the statistics is the total number of entries in
each table and index, as well as the number of disk blocks occupied
by each table and index. This information is kept in the table
pg_class
,
in the columns reltuples
and
relpages
. We can look at it with
queries similar to this one:
SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE relname LIKE 'tenk1%'; relname | relkind | reltuples | relpages ----------------------+---------+-----------+---------- tenk1 | r | 10000 | 358 tenk1_hundred | i | 10000 | 30 tenk1_thous_tenthous | i | 10000 | 30 tenk1_unique1 | i | 10000 | 30 tenk1_unique2 | i | 10000 | 30 (5 rows)
Here we can see that tenk1
contains 10000
rows, as do its indexes, but the indexes are (unsurprisingly) much
smaller than the table.
For efficiency reasons, reltuples
and relpages
are not updated on-the-fly,
and so they usually contain somewhat out-of-date values.
They are updated by VACUUM
, ANALYZE
, and a
few DDL commands such as CREATE INDEX
. A VACUUM
or ANALYZE
operation that does not scan the entire table
(which is commonly the case) will incrementally update the
reltuples
count on the basis of the part
of the table it did scan, resulting in an approximate value.
In any case, the planner
will scale the values it finds in pg_class
to match the current physical table size, thus obtaining a closer
approximation.
Most queries retrieve only a fraction of the rows in a table, due
to WHERE
clauses that restrict the rows to be
examined. The planner thus needs to make an estimate of the
selectivity of WHERE
clauses, that is,
the fraction of rows that match each condition in the
WHERE
clause. The information used for this task is
stored in the
pg_statistic
system catalog. Entries in pg_statistic
are updated by the ANALYZE
and VACUUM
ANALYZE
commands, and are always approximate even when freshly
updated.
Rather than look at pg_statistic
directly,
it's better to look at its view
pg_stats
when examining the statistics manually. pg_stats
is designed to be more easily readable. Furthermore,
pg_stats
is readable by all, whereas
pg_statistic
is only readable by a superuser.
(This prevents unprivileged users from learning something about
the contents of other people's tables from the statistics. The
pg_stats
view is restricted to show only
rows about tables that the current user can read.)
For example, we might do:
SELECT attname, inherited, n_distinct, array_to_string(most_common_vals, E'\n') as most_common_vals FROM pg_stats WHERE tablename = 'road'; attname | inherited | n_distinct | most_common_vals ---------+-----------+------------+------------------------------------ name | f | -0.363388 | I- 580 Ramp+ | | | I- 880 Ramp+ | | | Sp Railroad + | | | I- 580 + | | | I- 680 Ramp name | t | -0.284859 | I- 880 Ramp+ | | | I- 580 Ramp+ | | | I- 680 Ramp+ | | | I- 580 + | | | State Hwy 13 Ramp (2 rows)
Note that two rows are displayed for the same column, one corresponding
to the complete inheritance hierarchy starting at the
road
table (inherited
=t
),
and another one including only the road
table itself
(inherited
=f
).
The amount of information stored in pg_statistic
by ANALYZE
, in particular the maximum number of entries in the
most_common_vals
and histogram_bounds
arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS
command, or globally by setting the
default_statistics_target configuration variable.
The default limit is presently 100 entries. Raising the limit
might allow more accurate planner estimates to be made, particularly for
columns with irregular data distributions, at the price of consuming
more space in pg_statistic
and slightly more
time to compute the estimates. Conversely, a lower limit might be
sufficient for columns with simple data distributions.
Further details about the planner's use of statistics can be found in Chapter 72.
It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated. Regular statistics, because of their per-individual-column nature, cannot capture any knowledge about cross-column correlation. However, PostgreSQL has the ability to compute multivariate statistics, which can capture such information.
Because the number of possible column combinations is very large, it's impractical to compute multivariate statistics automatically. Instead, extended statistics objects, more often called just statistics objects, can be created to instruct the server to obtain statistics across interesting sets of columns.
Statistics objects are created using the
CREATE STATISTICS
command.
Creation of such an object merely creates a catalog entry expressing
interest in the statistics. Actual data collection is performed
by ANALYZE
(either a manual command, or background
auto-analyze). The collected values can be examined in the
pg_statistic_ext_data
catalog.
ANALYZE
computes extended statistics based on the same
sample of table rows that it takes for computing regular single-column
statistics. Since the sample size is increased by increasing the
statistics target for the table or any of its columns (as described in
the previous section), a larger statistics target will normally result in
more accurate extended statistics, as well as more time spent calculating
them.
The following subsections describe the kinds of extended statistics that are currently supported.
The simplest kind of extended statistics tracks functional
dependencies, a concept used in definitions of database normal forms.
We say that column b
is functionally dependent on
column a
if knowledge of the value of
a
is sufficient to determine the value
of b
, that is there are no two rows having the same value
of a
but different values of b
.
In a fully normalized database, functional dependencies should exist
only on primary keys and superkeys. However, in practice many data sets
are not fully normalized for various reasons; intentional
denormalization for performance reasons is a common example.
Even in a fully normalized database, there may be partial correlation
between some columns, which can be expressed as partial functional
dependency.
The existence of functional dependencies directly affects the accuracy of estimates in certain queries. If a query contains conditions on both the independent and the dependent column(s), the conditions on the dependent columns do not further reduce the result size; but without knowledge of the functional dependency, the query planner will assume that the conditions are independent, resulting in underestimating the result size.
To inform the planner about functional dependencies, ANALYZE
can collect measurements of cross-column dependency. Assessing the
degree of dependency between all sets of columns would be prohibitively
expensive, so data collection is limited to those groups of columns
appearing together in a statistics object defined with
the dependencies
option. It is advisable to create
dependencies
statistics only for column groups that are
strongly correlated, to avoid unnecessary overhead in both
ANALYZE
and later query planning.
Here is an example of collecting functional-dependency statistics:
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxname, stxkeys, stxddependencies FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts'; stxname | stxkeys | stxddependencies ---------+---------+------------------------------------------ stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130} (1 row)
Here it can be seen that column 1 (zip code) fully determines column 5 (city) so the coefficient is 1.0, while city only determines zip code about 42% of the time, meaning that there are many cities (58%) that are represented by more than a single ZIP code.
When computing the selectivity for a query involving functionally dependent columns, the planner adjusts the per-condition selectivity estimates using the dependency coefficients so as not to produce an underestimate.
Functional dependencies are currently only applied when considering
simple equality conditions that compare columns to constant values,
and IN
clauses with constant values.
They are not used to improve estimates for equality conditions
comparing two columns or comparing a column to an expression, nor for
range clauses, LIKE
or any other type of condition.
When estimating with functional dependencies, the planner assumes that conditions on the involved columns are compatible and hence redundant. If they are incompatible, the correct estimate would be zero rows, but that possibility is not considered. For example, given a query like
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
the planner will disregard the city
clause as not
changing the selectivity, which is correct. However, it will make
the same assumption about
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
even though there will really be zero rows satisfying this query. Functional dependency statistics do not provide enough information to conclude that, however.
In many practical situations, this assumption is usually satisfied; for example, there might be a GUI in the application that only allows selecting compatible city and ZIP code values to use in a query. But if that's not the case, functional dependencies may not be a viable option.
Single-column statistics store the number of distinct values in each
column. Estimates of the number of distinct values when combining more
than one column (for example, for GROUP BY a, b
) are
frequently wrong when the planner only has single-column statistical
data, causing it to select bad plans.
To improve such estimates, ANALYZE
can collect n-distinct
statistics for groups of columns. As before, it's impractical to do
this for every possible column grouping, so data is collected only for
those groups of columns appearing together in a statistics object
defined with the ndistinct
option. Data will be collected
for each possible combination of two or more columns from the set of
listed columns.
Continuing the previous example, the n-distinct counts in a table of ZIP codes might look like the following:
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxkeys AS k, stxdndistinct AS nd FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts2'; -[ RECORD 1 ]-------------------------------------------------------- k | 1 2 5 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178} (1 row)
This indicates that there are three combinations of columns that have 33178 distinct values: ZIP code and state; ZIP code and city; and ZIP code, city and state (the fact that they are all equal is expected given that ZIP code alone is unique in this table). On the other hand, the combination of city and state has only 27435 distinct values.
It's advisable to create ndistinct
statistics objects only
on combinations of columns that are actually used for grouping, and
for which misestimation of the number of groups is resulting in bad
plans. Otherwise, the ANALYZE
cycles are just wasted.
Another type of statistic stored for each column are most-common value lists. This allows very accurate estimates for individual columns, but may result in significant misestimates for queries with conditions on multiple columns.
To improve such estimates, ANALYZE
can collect MCV
lists on combinations of columns. Similarly to functional dependencies
and n-distinct coefficients, it's impractical to do this for every
possible column grouping. Even more so in this case, as the MCV list
(unlike functional dependencies and n-distinct coefficients) does store
the common column values. So data is collected only for those groups
of columns appearing together in a statistics object defined with the
mcv
option.
Continuing the previous example, the MCV list for a table of ZIP codes might look like the following (unlike for simpler types of statistics, a function is required for inspection of MCV contents):
CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes; ANALYZE zipcodes; SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid), pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3'; index | values | nulls | frequency | base_frequency -------+------------------------+-------+-----------+---------------- 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05 ... (99 rows)
This indicates that the most common combination of city and state is Washington in DC, with actual frequency (in the sample) about 0.35%. The base frequency of the combination (as computed from the simple per-column frequencies) is only 0.0027%, resulting in two orders of magnitude under-estimates.
It's advisable to create MCV statistics objects only
on combinations of columns that are actually used in conditions together,
and for which misestimation of the number of groups is resulting in bad
plans. Otherwise, the ANALYZE
and planning cycles
are just wasted.
JOIN
Clauses
It is possible
to control the query planner to some extent by using the explicit JOIN
syntax. To see why this matters, we first need some background.
In a simple join query, such as:
SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
the planner is free to join the given tables in any order. For
example, it could generate a query plan that joins A to B, using
the WHERE
condition a.id = b.id
, and then
joins C to this joined table, using the other WHERE
condition. Or it could join B to C and then join A to that result.
Or it could join A to C and then join them with B — but that
would be inefficient, since the full Cartesian product of A and C
would have to be formed, there being no applicable condition in the
WHERE
clause to allow optimization of the join. (All
joins in the PostgreSQL executor happen
between two input tables, so it's necessary to build up the result
in one or another of these fashions.) The important point is that
these different join possibilities give semantically equivalent
results but might have hugely different execution costs. Therefore,
the planner will explore all of them to try to find the most
efficient query plan.
When a query only involves two or three tables, there aren't many join orders to worry about. But the number of possible join orders grows exponentially as the number of tables expands. Beyond ten or so input tables it's no longer practical to do an exhaustive search of all the possibilities, and even for six or seven tables planning might take an annoyingly long time. When there are too many input tables, the PostgreSQL planner will switch from exhaustive search to a genetic probabilistic search through a limited number of possibilities. (The switch-over threshold is set by the geqo_threshold run-time parameter.) The genetic search takes less time, but it won't necessarily find the best possible plan.
When the query involves outer joins, the planner has less freedom than it does for plain (inner) joins. For example, consider:
SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
Although this query's restrictions are superficially similar to the previous example, the semantics are different because a row must be emitted for each row of A that has no matching row in the join of B and C. Therefore the planner has no choice of join order here: it must join B to C and then join A to that result. Accordingly, this query takes less time to plan than the previous query. In other cases, the planner might be able to determine that more than one join order is safe. For example, given:
SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);
it is valid to join A to either B or C first. Currently, only
FULL JOIN
completely constrains the join order. Most
practical cases involving LEFT JOIN
or RIGHT JOIN
can be rearranged to some extent.
Explicit inner join syntax (INNER JOIN
, CROSS
JOIN
, or unadorned JOIN
) is semantically the same as
listing the input relations in FROM
, so it does not
constrain the join order.
Even though most kinds of JOIN
don't completely constrain
the join order, it is possible to instruct the
PostgreSQL query planner to treat all
JOIN
clauses as constraining the join order anyway.
For example, these three queries are logically equivalent:
SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id; SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id; SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
But if we tell the planner to honor the JOIN
order,
the second and third take less time to plan than the first. This effect
is not worth worrying about for only three tables, but it can be a
lifesaver with many tables.
To force the planner to follow the join order laid out by explicit
JOIN
s,
set the join_collapse_limit run-time parameter to 1.
(Other possible values are discussed below.)
You do not need to constrain the join order completely in order to
cut search time, because it's OK to use JOIN
operators
within items of a plain FROM
list. For example, consider:
SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
With join_collapse_limit
= 1, this
forces the planner to join A to B before joining them to other tables,
but doesn't constrain its choices otherwise. In this example, the
number of possible join orders is reduced by a factor of 5.
Constraining the planner's search in this way is a useful technique
both for reducing planning time and for directing the planner to a
good query plan. If the planner chooses a bad join order by default,
you can force it to choose a better order via JOIN
syntax
— assuming that you know of a better order, that is. Experimentation
is recommended.
A closely related issue that affects planning time is collapsing of subqueries into their parent query. For example, consider:
SELECT * FROM x, y, (SELECT * FROM a, b, c WHERE something) AS ss WHERE somethingelse;
This situation might arise from use of a view that contains a join;
the view's SELECT
rule will be inserted in place of the view
reference, yielding a query much like the above. Normally, the planner
will try to collapse the subquery into the parent, yielding:
SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
This usually results in a better plan than planning the subquery
separately. (For example, the outer WHERE
conditions might be such that
joining X to A first eliminates many rows of A, thus avoiding the need to
form the full logical output of the subquery.) But at the same time,
we have increased the planning time; here, we have a five-way join
problem replacing two separate three-way join problems. Because of the
exponential growth of the number of possibilities, this makes a big
difference. The planner tries to avoid getting stuck in huge join search
problems by not collapsing a subquery if more than from_collapse_limit
FROM
items would result in the parent
query. You can trade off planning time against quality of plan by
adjusting this run-time parameter up or down.
from_collapse_limit and join_collapse_limit
are similarly named because they do almost the same thing: one controls
when the planner will “flatten out” subqueries, and the
other controls when it will flatten out explicit joins. Typically
you would either set join_collapse_limit
equal to
from_collapse_limit
(so that explicit joins and subqueries
act similarly) or set join_collapse_limit
to 1 (if you want
to control join order with explicit joins). But you might set them
differently if you are trying to fine-tune the trade-off between planning
time and run time.
One might need to insert a large amount of data when first populating a database. This section contains some suggestions on how to make this process as efficient as possible.
When using multiple INSERT
s, turn off autocommit and just do
one commit at the end. (In plain
SQL, this means issuing BEGIN
at the start and
COMMIT
at the end. Some client libraries might
do this behind your back, in which case you need to make sure the
library does it when you want it done.) If you allow each
insertion to be committed separately,
PostgreSQL is doing a lot of work for
each row that is added. An additional benefit of doing all
insertions in one transaction is that if the insertion of one row
were to fail then the insertion of all rows inserted up to that
point would be rolled back, so you won't be stuck with partially
loaded data.
COPY
Use COPY
to load
all the rows in one command, instead of using a series of
INSERT
commands. The COPY
command is optimized for loading large numbers of rows; it is less
flexible than INSERT
, but incurs significantly
less overhead for large data loads. Since COPY
is a single command, there is no need to disable autocommit if you
use this method to populate a table.
If you cannot use COPY
, it might help to use PREPARE
to create a
prepared INSERT
statement, and then use
EXECUTE
as many times as required. This avoids
some of the overhead of repeatedly parsing and planning
INSERT
. Different interfaces provide this facility
in different ways; look for “prepared statements” in the interface
documentation.
Note that loading a large number of rows using
COPY
is almost always faster than using
INSERT
, even if PREPARE
is used and
multiple insertions are batched into a single transaction.
COPY
is fastest when used within the same
transaction as an earlier CREATE TABLE
or
TRUNCATE
command. In such cases no WAL
needs to be written, because in case of an error, the files
containing the newly loaded data will be removed anyway.
However, this consideration only applies when
wal_level is minimal
as all commands must write WAL otherwise.
If you are loading a freshly created table, the fastest method is to
create the table, bulk load the table's data using
COPY
, then create any indexes needed for the
table. Creating an index on pre-existing data is quicker than
updating it incrementally as each row is loaded.
If you are adding large amounts of data to an existing table, it might be a win to drop the indexes, load the table, and then recreate the indexes. Of course, the database performance for other users might suffer during the time the indexes are missing. One should also think twice before dropping a unique index, since the error checking afforded by the unique constraint will be lost while the index is missing.
Just as with indexes, a foreign key constraint can be checked “in bulk” more efficiently than row-by-row. So it might be useful to drop foreign key constraints, load data, and re-create the constraints. Again, there is a trade-off between data load speed and loss of error checking while the constraint is missing.
What's more, when you load data into a table with existing foreign key constraints, each new row requires an entry in the server's list of pending trigger events (since it is the firing of a trigger that checks the row's foreign key constraint). Loading many millions of rows can cause the trigger event queue to overflow available memory, leading to intolerable swapping or even outright failure of the command. Therefore it may be necessary, not just desirable, to drop and re-apply foreign keys when loading large amounts of data. If temporarily removing the constraint isn't acceptable, the only other recourse may be to split up the load operation into smaller transactions.
maintenance_work_mem
Temporarily increasing the maintenance_work_mem
configuration variable when loading large amounts of data can
lead to improved performance. This will help to speed up CREATE
INDEX
commands and ALTER TABLE ADD FOREIGN KEY
commands.
It won't do much for COPY
itself, so this advice is
only useful when you are using one or both of the above techniques.
max_wal_size
Temporarily increasing the max_wal_size
configuration variable can also
make large data loads faster. This is because loading a large
amount of data into PostgreSQL will
cause checkpoints to occur more often than the normal checkpoint
frequency (specified by the checkpoint_timeout
configuration variable). Whenever a checkpoint occurs, all dirty
pages must be flushed to disk. By increasing
max_wal_size
temporarily during bulk
data loads, the number of checkpoints that are required can be
reduced.
When loading large amounts of data into an installation that uses
WAL archiving or streaming replication, it might be faster to take a
new base backup after the load has completed than to process a large
amount of incremental WAL data. To prevent incremental WAL logging
while loading, disable archiving and streaming replication, by setting
wal_level to minimal
,
archive_mode to off
, and
max_wal_senders to zero.
But note that changing these settings requires a server restart,
and makes any base backups taken before unavailable for archive
recovery and standby server, which may lead to data loss.
Aside from avoiding the time for the archiver or WAL sender to process the
WAL data, doing this will actually make certain commands faster, because
they do not to write WAL at all if wal_level
is minimal
and the current subtransaction (or top-level
transaction) created or truncated the table or index they change. (They
can guarantee crash safety more cheaply by doing
an fsync
at the end than by writing WAL.)
ANALYZE
Afterwards
Whenever you have significantly altered the distribution of data
within a table, running ANALYZE
is strongly recommended. This
includes bulk loading large amounts of data into the table. Running
ANALYZE
(or VACUUM ANALYZE
)
ensures that the planner has up-to-date statistics about the
table. With no statistics or obsolete statistics, the planner might
make poor decisions during query planning, leading to poor
performance on any tables with inaccurate or nonexistent
statistics. Note that if the autovacuum daemon is enabled, it might
run ANALYZE
automatically; see
Section 25.1.3
and Section 25.1.6 for more information.
Dump scripts generated by pg_dump automatically apply several, but not all, of the above guidelines. To restore a pg_dump dump as quickly as possible, you need to do a few extra things manually. (Note that these points apply while restoring a dump, not while creating it. The same points apply whether loading a text dump with psql or using pg_restore to load from a pg_dump archive file.)
By default, pg_dump uses COPY
, and when
it is generating a complete schema-and-data dump, it is careful to
load data before creating indexes and foreign keys. So in this case
several guidelines are handled automatically. What is left
for you to do is to:
Set appropriate (i.e., larger than normal) values for
maintenance_work_mem
and
max_wal_size
.
If using WAL archiving or streaming replication, consider disabling
them during the restore. To do that, set archive_mode
to off
,
wal_level
to minimal
, and
max_wal_senders
to zero before loading the dump.
Afterwards, set them back to the right values and take a fresh
base backup.
Experiment with the parallel dump and restore modes of both
pg_dump and pg_restore and find the
optimal number of concurrent jobs to use. Dumping and restoring in
parallel by means of the -j
option should give you a
significantly higher performance over the serial mode.
Consider whether the whole dump should be restored as a single
transaction. To do that, pass the -1
or
--single-transaction
command-line option to
psql or pg_restore. When using this
mode, even the smallest of errors will rollback the entire restore,
possibly discarding many hours of processing. Depending on how
interrelated the data is, that might seem preferable to manual cleanup,
or not. COPY
commands will run fastest if you use a single
transaction and have WAL archiving turned off.
If multiple CPUs are available in the database server, consider using
pg_restore's --jobs
option. This
allows concurrent data loading and index creation.
Run ANALYZE
afterwards.
A data-only dump will still use COPY
, but it does not
drop or recreate indexes, and it does not normally touch foreign
keys.
[14]
So when loading a data-only dump, it is up to you to drop and recreate
indexes and foreign keys if you wish to use those techniques.
It's still useful to increase max_wal_size
while loading the data, but don't bother increasing
maintenance_work_mem
; rather, you'd do that while
manually recreating indexes and foreign keys afterwards.
And don't forget to ANALYZE
when you're done; see
Section 25.1.3
and Section 25.1.6 for more information.
Durability is a database feature that guarantees the recording of committed transactions even if the server crashes or loses power. However, durability adds significant database overhead, so if your site does not require such a guarantee, PostgreSQL can be configured to run much faster. The following are configuration changes you can make to improve performance in such cases. Except as noted below, durability is still guaranteed in case of a crash of the database software; only an abrupt operating system crash creates a risk of data loss or corruption when these settings are used.
Place the database cluster's data directory in a memory-backed file system (i.e., RAM disk). This eliminates all database disk I/O, but limits data storage to the amount of available memory (and perhaps swap).
Turn off fsync; there is no need to flush data to disk.
Turn off synchronous_commit; there might be no need to force WAL writes to disk on every commit. This setting does risk transaction loss (though not data corruption) in case of a crash of the database.
Turn off full_page_writes; there is no need to guard against partial page writes.
Increase max_wal_size and checkpoint_timeout; this reduces the frequency
of checkpoints, but increases the storage requirements of
/pg_wal
.
Create unlogged tables to avoid WAL writes, though it makes the tables non-crash-safe.
[14]
You can get the effect of disabling foreign keys by using
the --disable-triggers
option — but realize that
that eliminates, rather than just postpones, foreign key
validation, and so it is possible to insert bad data if you use it.
Table of Contents
PostgreSQL can devise query plans that can leverage multiple CPUs in order to answer queries faster. This feature is known as parallel query. Many queries cannot benefit from parallel query, either due to limitations of the current implementation or because there is no imaginable query plan that is any faster than the serial query plan. However, for queries that can benefit, the speedup from parallel query is often very significant. Many queries can run more than twice as fast when using parallel query, and some queries can run four times faster or even more. Queries that touch a large amount of data but return only a few rows to the user will typically benefit most. This chapter explains some details of how parallel query works and in which situations it can be used so that users who wish to make use of it can understand what to expect.
When the optimizer determines that parallel query is the fastest execution strategy for a particular query, it will create a query plan that includes a Gather or Gather Merge node. Here is a simple example:
EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%'; QUERY PLAN ------------------------------------------------------------------------------------- Gather (cost=1000.00..217018.43 rows=1 width=97) Workers Planned: 2 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..216018.33 rows=1 width=97) Filter: (filler ~~ '%x%'::text) (4 rows)
In all cases, the Gather
or
Gather Merge
node will have exactly one
child plan, which is the portion of the plan that will be executed in
parallel. If the Gather
or Gather Merge
node is
at the very top of the plan tree, then the entire query will execute in
parallel. If it is somewhere else in the plan tree, then only the portion
of the plan below it will run in parallel. In the example above, the
query accesses only one table, so there is only one plan node other than
the Gather
node itself; since that plan node is a child of the
Gather
node, it will run in parallel.
Using EXPLAIN, you can see the number of
workers chosen by the planner. When the Gather
node is reached
during query execution, the process that is implementing the user's
session will request a number of background
worker processes equal to the number
of workers chosen by the planner. The number of background workers that
the planner will consider using is limited to at most
max_parallel_workers_per_gather. The total number
of background workers that can exist at any one time is limited by both
max_worker_processes and
max_parallel_workers. Therefore, it is possible for a
parallel query to run with fewer workers than planned, or even with
no workers at all. The optimal plan may depend on the number of workers
that are available, so this can result in poor query performance. If this
occurrence is frequent, consider increasing
max_worker_processes
and max_parallel_workers
so that more workers can be run simultaneously or alternatively reducing
max_parallel_workers_per_gather
so that the planner
requests fewer workers.
Every background worker process that is successfully started for a given
parallel query will execute the parallel portion of the plan. The leader
will also execute that portion of the plan, but it has an additional
responsibility: it must also read all of the tuples generated by the
workers. When the parallel portion of the plan generates only a small
number of tuples, the leader will often behave very much like an additional
worker, speeding up query execution. Conversely, when the parallel portion
of the plan generates a large number of tuples, the leader may be almost
entirely occupied with reading the tuples generated by the workers and
performing any further processing steps that are required by plan nodes
above the level of the Gather
node or
Gather Merge
node. In such cases, the leader will
do very little of the work of executing the parallel portion of the plan.
When the node at the top of the parallel portion of the plan is
Gather Merge
rather than Gather
, it indicates that
each process executing the parallel portion of the plan is producing
tuples in sorted order, and that the leader is performing an
order-preserving merge. In contrast, Gather
reads tuples
from the workers in whatever order is convenient, destroying any sort
order that may have existed.
There are several settings that can cause the query planner not to generate a parallel query plan under any circumstances. In order for any parallel query plans whatsoever to be generated, the following settings must be configured as indicated.
max_parallel_workers_per_gather must be set to a
value that is greater than zero. This is a special case of the more
general principle that no more workers should be used than the number
configured via max_parallel_workers_per_gather
.
In addition, the system must not be running in single-user mode. Since the entire database system is running as a single process in this situation, no background workers will be available.
Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:
The query writes any data or locks any database rows. If a query
contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated. As an
exception, the following commands, which create a new table and populate
it, can use a parallel plan for the underlying SELECT
part of the query:
CREATE TABLE ... AS
SELECT INTO
CREATE MATERIALIZED VIEW
REFRESH MATERIALIZED VIEW
The query might be suspended during execution. In any situation in
which the system thinks that partial or incremental execution might
occur, no parallel plan is generated. For example, a cursor created
using DECLARE CURSOR will never use
a parallel plan. Similarly, a PL/pgSQL loop of the form
FOR x IN query LOOP .. END LOOP
will never use a
parallel plan, because the parallel query system is unable to verify
that the code in the loop is safe to execute while parallel query is
active.
The query uses any function marked PARALLEL UNSAFE
.
Most system-defined functions are PARALLEL SAFE
,
but user-defined functions are marked PARALLEL
UNSAFE
by default. See the discussion of
Section 15.4.
The query is running inside of another query that is already parallel. For example, if a function called by a parallel query issues an SQL query itself, that query will never use a parallel plan. This is a limitation of the current implementation, but it may not be desirable to remove this limitation, since it could result in a single query using a very large number of processes.
Even when a parallel query plan is generated for a particular query, there
are several circumstances under which it will be impossible to execute
that plan in parallel at execution time. If this occurs, the leader
will execute the portion of the plan below the Gather
node entirely by itself, almost as if the Gather
node were
not present. This will happen if any of the following conditions are met:
No background workers can be obtained because of the limitation that the total number of background workers cannot exceed max_worker_processes.
No background workers can be obtained because of the limitation that the total number of background workers launched for purposes of parallel query cannot exceed max_parallel_workers.
The client sends an Execute message with a non-zero fetch count. See the discussion of the extended query protocol. Since libpq currently provides no way to send such a message, this can only occur when using a client that does not rely on libpq. If this is a frequent occurrence, it may be a good idea to set max_parallel_workers_per_gather to zero in sessions where it is likely, so as to avoid generating query plans that may be suboptimal when run serially.
Because each worker executes the parallel portion of the plan to completion, it is not possible to simply take an ordinary query plan and run it using multiple workers. Each worker would produce a full copy of the output result set, so the query would not run any faster than normal but would produce incorrect results. Instead, the parallel portion of the plan must be what is known internally to the query optimizer as a partial plan; that is, it must be constructed so that each process that executes the plan will generate only a subset of the output rows in such a way that each required output row is guaranteed to be generated by exactly one of the cooperating processes. Generally, this means that the scan on the driving table of the query must be a parallel-aware scan.
The following types of parallel-aware table scans are currently supported.
In a parallel sequential scan, the table's blocks will be divided into ranges and shared among the cooperating processes. Each worker process will complete the scanning of its given range of blocks before requesting an additional range of blocks.
In a parallel bitmap heap scan, one process is chosen as the leader. That process performs a scan of one or more indexes and builds a bitmap indicating which table blocks need to be visited. These blocks are then divided among the cooperating processes as in a parallel sequential scan. In other words, the heap scan is performed in parallel, but the underlying index scan is not.
In a parallel index scan or parallel index-only scan, the cooperating processes take turns reading data from the index. Currently, parallel index scans are supported only for btree indexes. Each process will claim a single index block and will scan and return all tuples referenced by that block; other processes can at the same time be returning tuples from a different index block. The results of a parallel btree scan are returned in sorted order within each worker process.
Other scan types, such as scans of non-btree indexes, may support parallel scans in the future.
Just as in a non-parallel plan, the driving table may be joined to one or more other tables using a nested loop, hash join, or merge join. The inner side of the join may be any kind of non-parallel plan that is otherwise supported by the planner provided that it is safe to run within a parallel worker. Depending on the join type, the inner side may also be a parallel plan.
In a nested loop join, the inner side is always non-parallel. Although it is executed in full, this is efficient if the inner side is an index scan, because the outer tuples and thus the loops that look up values in the index are divided over the cooperating processes.
In a merge join, the inner side is always a non-parallel plan and therefore executed in full. This may be inefficient, especially if a sort must be performed, because the work and resulting data are duplicated in every cooperating process.
In a hash join (without the "parallel" prefix), the inner side is executed in full by every cooperating process to build identical copies of the hash table. This may be inefficient if the hash table is large or the plan is expensive. In a parallel hash join, the inner side is a parallel hash that divides the work of building a shared hash table over the cooperating processes.
PostgreSQL supports parallel aggregation by aggregating in
two stages. First, each process participating in the parallel portion of
the query performs an aggregation step, producing a partial result for
each group of which that process is aware. This is reflected in the plan
as a Partial Aggregate
node. Second, the partial results are
transferred to the leader via Gather
or Gather
Merge
. Finally, the leader re-aggregates the results across all
workers in order to produce the final result. This is reflected in the
plan as a Finalize Aggregate
node.
Because the Finalize Aggregate
node runs on the leader
process, queries that produce a relatively large number of groups in
comparison to the number of input rows will appear less favorable to the
query planner. For example, in the worst-case scenario the number of
groups seen by the Finalize Aggregate
node could be as many as
the number of input rows that were seen by all worker processes in the
Partial Aggregate
stage. For such cases, there is clearly
going to be no performance benefit to using parallel aggregation. The
query planner takes this into account during the planning process and is
unlikely to choose parallel aggregate in this scenario.
Parallel aggregation is not supported in all situations. Each aggregate
must be safe for parallelism and must
have a combine function. If the aggregate has a transition state of type
internal
, it must have serialization and deserialization
functions. See CREATE AGGREGATE for more details.
Parallel aggregation is not supported if any aggregate function call
contains DISTINCT
or ORDER BY
clause and is also
not supported for ordered set aggregates or when the query involves
GROUPING SETS
. It can only be used when all joins involved in
the query are also part of the parallel portion of the plan.
Whenever PostgreSQL needs to combine rows
from multiple sources into a single result set, it uses an
Append
or MergeAppend
plan node.
This commonly happens when implementing UNION ALL
or
when scanning a partitioned table. Such nodes can be used in parallel
plans just as they can in any other plan. However, in a parallel plan,
the planner may instead use a Parallel Append
node.
When an Append
node is used in a parallel plan, each
process will execute the child plans in the order in which they appear,
so that all participating processes cooperate to execute the first child
plan until it is complete and then move to the second plan at around the
same time. When a Parallel Append
is used instead, the
executor will instead spread out the participating processes as evenly as
possible across its child plans, so that multiple child plans are executed
simultaneously. This avoids contention, and also avoids paying the startup
cost of a child plan in those processes that never execute it.
Also, unlike a regular Append
node, which can only have
partial children when used within a parallel plan, a Parallel
Append
node can have both partial and non-partial child plans.
Non-partial children will be scanned by only a single process, since
scanning them more than once would produce duplicate results. Plans that
involve appending multiple results sets can therefore achieve
coarse-grained parallelism even when efficient partial plans are not
available. For example, consider a query against a partitioned table
that can only be implemented efficiently by using an index that does
not support parallel scans. The planner might choose a Parallel
Append
of regular Index Scan
plans; each
individual index scan would have to be executed to completion by a single
process, but different scans could be performed at the same time by
different processes.
enable_parallel_append can be used to disable this feature.
If a query that is expected to do so does not produce a parallel plan, you can try reducing parallel_setup_cost or parallel_tuple_cost. Of course, this plan may turn out to be slower than the serial plan that the planner preferred, but this will not always be the case. If you don't get a parallel plan even with very small values of these settings (e.g., after setting them both to zero), there may be some reason why the query planner is unable to generate a parallel plan for your query. See Section 15.2 and Section 15.4 for information on why this may be the case.
When executing a parallel plan, you can use EXPLAIN (ANALYZE,
VERBOSE)
to display per-worker statistics for each plan node.
This may be useful in determining whether the work is being evenly
distributed between all plan nodes and more generally in understanding the
performance characteristics of the plan.
The planner classifies operations involved in a query as either
parallel safe, parallel restricted,
or parallel unsafe. A parallel safe operation is one that
does not conflict with the use of parallel query. A parallel restricted
operation is one that cannot be performed in a parallel worker, but that
can be performed in the leader while parallel query is in use. Therefore,
parallel restricted operations can never occur below a Gather
or Gather Merge
node, but can occur elsewhere in a plan that
contains such a node. A parallel unsafe operation is one that cannot
be performed while parallel query is in use, not even in the leader.
When a query contains anything that is parallel unsafe, parallel query
is completely disabled for that query.
The following operations are always parallel restricted:
Scans of common table expressions (CTEs).
Scans of temporary tables.
Scans of foreign tables, unless the foreign data wrapper has
an IsForeignScanParallelSafe
API that indicates otherwise.
Plan nodes to which an InitPlan
is attached.
Plan nodes that reference a correlated SubPlan
.
The planner cannot automatically determine whether a user-defined
function or aggregate is parallel safe, parallel restricted, or parallel
unsafe, because this would require predicting every operation that the
function could possibly perform. In general, this is equivalent to the
Halting Problem and therefore impossible. Even for simple functions
where it could conceivably be done, we do not try, since this would be expensive
and error-prone. Instead, all user-defined functions are assumed to
be parallel unsafe unless otherwise marked. When using
CREATE FUNCTION or
ALTER FUNCTION, markings can be set by specifying
PARALLEL SAFE
, PARALLEL RESTRICTED
, or
PARALLEL UNSAFE
as appropriate. When using
CREATE AGGREGATE, the
PARALLEL
option can be specified with SAFE
,
RESTRICTED
, or UNSAFE
as the corresponding value.
Functions and aggregates must be marked PARALLEL UNSAFE
if
they write to the database, access sequences, change the transaction state
even temporarily (e.g., a PL/pgSQL function that establishes an
EXCEPTION
block to catch errors), or make persistent changes to
settings. Similarly, functions must be marked PARALLEL
RESTRICTED
if they access temporary tables, client connection state,
cursors, prepared statements, or miscellaneous backend-local state that
the system cannot synchronize across workers. For example,
setseed
and random
are parallel restricted for
this last reason.
In general, if a function is labeled as being safe when it is restricted or
unsafe, or if it is labeled as being restricted when it is in fact unsafe,
it may throw errors or produce wrong answers when used in a parallel query.
C-language functions could in theory exhibit totally undefined behavior if
mislabeled, since there is no way for the system to protect itself against
arbitrary C code, but in most likely cases the result will be no worse than
for any other function. If in doubt, it is probably best to label functions
as UNSAFE
.
If a function executed within a parallel worker acquires locks that are
not held by the leader, for example by querying a table not referenced in
the query, those locks will be released at worker exit, not end of
transaction. If you write a function that does this, and this behavior
difference is important to you, mark such functions as
PARALLEL RESTRICTED
to ensure that they execute only in the leader.
Note that the query planner does not consider deferring the evaluation of
parallel-restricted functions or aggregates involved in the query in
order to obtain a superior plan. So, for example, if a WHERE
clause applied to a particular table is parallel restricted, the query
planner will not consider performing a scan of that table in the parallel
portion of a plan. In some cases, it would be
possible (and perhaps even efficient) to include the scan of that table in
the parallel portion of the query and defer the evaluation of the
WHERE
clause so that it happens above the Gather
node. However, the planner does not do this.
This part covers topics that are of interest to a PostgreSQL database administrator. This includes installation of the software, set up and configuration of the server, management of users and databases, and maintenance tasks. Anyone who runs a PostgreSQL server, even for personal use, but especially in production, should be familiar with the topics covered in this part.
The information in this part is arranged approximately in the order in which a new user should read it. But the chapters are self-contained and can be read individually as desired. The information in this part is presented in a narrative fashion in topical units. Readers looking for a complete description of a particular command should see Part VI.
The first few chapters are written so they can be understood without prerequisite knowledge, so new users who need to set up their own server can begin their exploration with this part. The rest of this part is about tuning and management; that material assumes that the reader is familiar with the general use of the PostgreSQL database system. Readers are encouraged to look at Part I and Part II for additional information.
Table of Contents
pg_hba.conf
FilePostgreSQL is available in the form of binary packages for most common operating systems today. When available, this is the recommended way to install PostgreSQL for users of the system. Building from source (see Chapter 17) is only recommended for people developing PostgreSQL or extensions.
For an updated list of platforms providing binary packages, please visit the download section on the PostgreSQL website at https://www.postgresql.org/download/ and follow the instructions for the specific platform.
Table of Contents
This chapter describes the installation of PostgreSQL using the source code distribution. If you are installing a pre-packaged distribution, such as an RPM or Debian package, ignore this chapter and see Chapter 16 instead.
If you are building PostgreSQL for Microsoft Windows, read this chapter if you intend to build with MinGW or Cygwin; but if you intend to build with Microsoft's Visual C++, see Chapter 18 instead.
./configure make su make install adduser postgres mkdir /usr/local/pgsql/data chown postgres /usr/local/pgsql/data su - postgres /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start /usr/local/pgsql/bin/createdb test /usr/local/pgsql/bin/psql test
The long version is the rest of this chapter.
In general, a modern Unix-compatible platform should be able to run PostgreSQL. The platforms that had received specific testing at the time of release are described in Section 17.6 below.
The following software packages are required for building PostgreSQL:
GNU make version 3.80 or newer is required; other
make programs or older GNU make versions will not work.
(GNU make is sometimes installed under
the name gmake
.) To test for GNU
make enter:
make --version
You need an ISO/ANSI C compiler (at least C99-compliant). Recent versions of GCC are recommended, but PostgreSQL is known to build using a wide variety of compilers from different vendors.
tar is required to unpack the source distribution, in addition to either gzip or bzip2.
The GNU Readline library is used by
default. It allows psql (the
PostgreSQL command line SQL interpreter) to remember each
command you type, and allows you to use arrow keys to recall and
edit previous commands. This is very helpful and is strongly
recommended. If you don't want to use it then you must specify
the --without-readline
option to
configure
. As an alternative, you can often use the
BSD-licensed libedit
library, originally
developed on NetBSD. The
libedit
library is
GNU Readline-compatible and is used if
libreadline
is not found, or if
--with-libedit-preferred
is used as an
option to configure
. If you are using a package-based
Linux distribution, be aware that you need both the
readline
and readline-devel
packages, if
those are separate in your distribution.
The zlib compression library is
used by default. If you don't want to use it then you must
specify the --without-zlib
option to
configure
. Using this option disables
support for compressed archives in pg_dump and
pg_restore.
The following packages are optional. They are not required in the default configuration, but they are needed when certain build options are enabled, as explained below:
To build the server programming language
PL/Perl you need a full
Perl installation, including the
libperl
library and the header files.
The minimum required version is Perl 5.8.3.
Since PL/Perl will be a shared
library, the
libperl
library must be a shared library
also on most platforms. This appears to be the default in
recent Perl versions, but it was not
in earlier versions, and in any case it is the choice of whomever
installed Perl at your site. configure
will fail
if building PL/Perl is selected but it cannot
find a shared libperl
. In that case, you will have
to rebuild and install Perl manually to be
able to build PL/Perl. During the
configuration process for Perl, request a
shared library.
If you intend to make more than incidental use of
PL/Perl, you should ensure that the
Perl installation was built with the
usemultiplicity
option enabled (perl -V
will show whether this is the case).
To build the PL/Python server programming language, you need a Python installation with the header files and the sysconfig module. The minimum required version is Python 2.7. Python 3 is supported if it's version 3.2 or later; but see Section 46.1 when using Python 3.
Since PL/Python will be a shared
library, the
libpython
library must be a shared library
also on most platforms. This is not the case in a default
Python installation built from source, but a
shared library is available in many operating system
distributions. configure
will fail if
building PL/Python is selected but it cannot
find a shared libpython
. That might mean that you
either have to install additional packages or rebuild (part of) your
Python installation to provide this shared
library. When building from source, run Python's
configure with the --enable-shared
flag.
To build the PL/Tcl procedural language, you of course need a Tcl installation. The minimum required version is Tcl 8.4.
To enable Native Language Support (NLS), that is, the ability to display a program's messages in a language other than English, you need an implementation of the Gettext API. Some operating systems have this built-in (e.g., Linux, NetBSD, Solaris), for other systems you can download an add-on package from https://www.gnu.org/software/gettext/. If you are using the Gettext implementation in the GNU C library then you will additionally need the GNU Gettext package for some utility programs. For any of the other implementations you will not need it.
You need OpenSSL, if you want to support
encrypted client connections. OpenSSL is
also required for random number generation on platforms that do not
have /dev/urandom
(except Windows). The minimum
version required is 1.0.1.
You need Kerberos, OpenLDAP, and/or PAM, if you want to support authentication using those services.
You need LZ4, if you want to support compression of data with this method; see default_toast_compression.
To build the PostgreSQL documentation, there is a separate set of requirements; see Section J.2.
If you are building from a Git tree instead of using a released source package, or if you want to do server development, you also need the following packages:
Flex and Bison are needed to build from a Git checkout, or if you changed the actual scanner and parser definition files. If you need them, be sure to get Flex 2.5.31 or later and Bison 1.875 or later. Other lex and yacc programs cannot be used.
Perl 5.8.3 or later is needed to build from a Git checkout, or if you changed the input files for any of the build steps that use Perl scripts. If building on Windows you will need Perl in any case. Perl is also required to run some test suites.
If you need to get a GNU package, you can find it at your local GNU mirror site (see https://www.gnu.org/prep/ftp for a list) or at ftp://ftp.gnu.org/gnu/.
Also check that you have sufficient disk space. You will need about
350 MB for the source tree during compilation and about 60 MB for
the installation directory. An empty database cluster takes about
40 MB; databases take about five times the amount of space that a
flat text file with the same data would take. If you are going to
run the regression tests you will temporarily need up to an extra
300 MB. Use the df
command to check free disk
space.
The PostgreSQL 14.13 sources can be obtained from the
download section of our
website: https://www.postgresql.org/download/. You
should get a file named postgresql-14.13.tar.gz
or postgresql-14.13.tar.bz2
. After
you have obtained the file, unpack it:
gunzip postgresql-14.13.tar.gz
tar xf postgresql-14.13.tar
(Use bunzip2
instead of gunzip
if
you have the .bz2
file. Also, note that most
modern versions of tar
can unpack compressed archives
directly, so you don't really need the
separate gunzip
or bunzip2
step.)
This will create a directory
postgresql-14.13
under the current directory
with the PostgreSQL sources.
Change into that directory for the rest
of the installation procedure.
You can also get the source directly from the version control repository, see Appendix I.
Configuration
The first step of the installation procedure is to configure the
source tree for your system and choose the options you would like.
This is done by running the configure
script. For a
default installation simply enter:
./configure
This script will run a number of tests to determine values for various system dependent variables and detect any quirks of your operating system, and finally will create several files in the build tree to record what it found.
You can also run configure
in a directory outside
the source tree, and then build there, if you want to keep the build
directory separate from the original source files. This procedure is
called a
VPATH
build. Here's how:
mkdir build_dir
cd build_dir
/path/to/source/tree/configure [options go here]
make
The default configuration will build the server and utilities, as
well as all client applications and interfaces that require only a
C compiler. All files will be installed under
/usr/local/pgsql
by default.
You can customize the build and installation process by supplying one
or more command line options to configure
.
Typically you would customize the install location, or the set of
optional features that are built. configure
has a large number of options, which are described in
Section 17.4.1.
Also, configure
responds to certain environment
variables, as described in Section 17.4.2.
These provide additional ways to customize the configuration.
Build
To start the build, type either of:
make
make all
(Remember to use GNU make.) The build will take a few minutes depending on your hardware.
If you want to build everything that can be built, including the
documentation (HTML and man pages), and the additional modules
(contrib
), type instead:
make world
If you want to build everything that can be built, including the
additional modules (contrib
), but without
the documentation, type instead:
make world-bin
If you want to invoke the build from another makefile rather than
manually, you must unset MAKELEVEL
or set it to zero,
for instance like this:
build-postgresql: $(MAKE) -C postgresql MAKELEVEL=0 all
Failure to do that can lead to strange error messages, typically about missing header files.
Regression Tests
If you want to test the newly built server before you install it, you can run the regression tests at this point. The regression tests are a test suite to verify that PostgreSQL runs on your machine in the way the developers expected it to. Type:
make check
(This won't work as root; do it as an unprivileged user.) See Chapter 33 for detailed information about interpreting the test results. You can repeat this test at any later time by issuing the same command.
Installing the Files
If you are upgrading an existing system be sure to read Section 19.6, which has instructions about upgrading a cluster.
To install PostgreSQL enter:
make install
This will install files into the directories that were specified in Step 1. Make sure that you have appropriate permissions to write into that area. Normally you need to do this step as root. Alternatively, you can create the target directories in advance and arrange for appropriate permissions to be granted.
To install the documentation (HTML and man pages), enter:
make install-docs
If you built the world above, type instead:
make install-world
This also installs the documentation.
If you built the world without the documentation above, type instead:
make install-world-bin
You can use make install-strip
instead of
make install
to strip the executable files and
libraries as they are installed. This will save some space. If
you built with debugging support, stripping will effectively
remove the debugging support, so it should only be done if
debugging is no longer needed. install-strip
tries to do a reasonable job saving space, but it does not have
perfect knowledge of how to strip every unneeded byte from an
executable file, so if you want to save all the disk space you
possibly can, you will have to do manual work.
The standard installation provides all the header files needed for client application development as well as for server-side program development, such as custom functions or data types written in C.
Client-only installation: If you want to install only the client applications and interface libraries, then you can use these commands:
make -C src/bin install
make -C src/include install
make -C src/interfaces install
make -C doc install
src/bin
has a few binaries for server-only use,
but they are small.
Uninstallation:
To undo the installation use the command make
uninstall
. However, this will not remove any created directories.
Cleaning:
After the installation you can free disk space by removing the built
files from the source tree with the command make
clean
. This will preserve the files made by the configure
program, so that you can rebuild everything with make
later on. To reset the source tree to the state in which it was
distributed, use make distclean
. If you are going to
build for several platforms within the same source tree you must do
this and re-configure for each platform. (Alternatively, use
a separate build tree for each platform, so that the source tree
remains unmodified.)
If you perform a build and then discover that your configure
options were wrong, or if you change anything that configure
investigates (for example, software upgrades), then it's a good
idea to do make distclean
before reconfiguring and
rebuilding. Without this, your changes in configuration choices
might not propagate everywhere they need to.
configure
Options
configure
's command line options are explained below.
This list is not exhaustive (use ./configure --help
to get one that is). The options not covered here are meant for
advanced use-cases such as cross-compilation, and are documented in
the standard Autoconf documentation.
These options control where make install
will put
the files. The --prefix
option is sufficient for
most cases. If you have special needs, you can customize the
installation subdirectories with the other options described in this
section. Beware however that changing the relative locations of the
different subdirectories may render the installation non-relocatable,
meaning you won't be able to move it after installation.
(The man
and doc
locations are
not affected by this restriction.) For relocatable installs, you
might want to use the --disable-rpath
option
described later.
--prefix=PREFIX
Install all files under the directory PREFIX
instead of /usr/local/pgsql
. The actual
files will be installed into various subdirectories; no files
will ever be installed directly into the
PREFIX
directory.
--exec-prefix=EXEC-PREFIX
You can install architecture-dependent files under a
different prefix, EXEC-PREFIX
, than what
PREFIX
was set to. This can be useful to
share architecture-independent files between hosts. If you
omit this, then EXEC-PREFIX
is set equal to
PREFIX
and both architecture-dependent and
independent files will be installed under the same tree,
which is probably what you want.
--bindir=DIRECTORY
Specifies the directory for executable programs. The default
is
, which
normally means EXEC-PREFIX
/bin/usr/local/pgsql/bin
.
--sysconfdir=DIRECTORY
Sets the directory for various configuration files,
by default.
PREFIX
/etc
--libdir=DIRECTORY
Sets the location to install libraries and dynamically loadable
modules. The default is
.
EXEC-PREFIX
/lib
--includedir=DIRECTORY
Sets the directory for installing C and C++ header files. The
default is
.
PREFIX
/include
--datarootdir=DIRECTORY
Sets the root directory for various types of read-only data
files. This only sets the default for some of the following
options. The default is
.
PREFIX
/share
--datadir=DIRECTORY
Sets the directory for read-only data files used by the
installed programs. The default is
. Note that this has
nothing to do with where your database files will be placed.
DATAROOTDIR
--localedir=DIRECTORY
Sets the directory for installing locale data, in particular
message translation catalog files. The default is
.
DATAROOTDIR
/locale
--mandir=DIRECTORY
The man pages that come with PostgreSQL will be installed under
this directory, in their respective
man
subdirectories.
The default is x
.
DATAROOTDIR
/man
--docdir=DIRECTORY
Sets the root directory for installing documentation files,
except “man” pages. This only sets the default for
the following options. The default value for this option is
.
DATAROOTDIR
/doc/postgresql
--htmldir=DIRECTORY
The HTML-formatted documentation for
PostgreSQL will be installed under
this directory. The default is
.
DATAROOTDIR
Care has been taken to make it possible to install
PostgreSQL into shared installation locations
(such as /usr/local/include
) without
interfering with the namespace of the rest of the system. First,
the string “/postgresql
” is
automatically appended to datadir
,
sysconfdir
, and docdir
,
unless the fully expanded directory name already contains the
string “postgres
” or
“pgsql
”. For example, if you choose
/usr/local
as prefix, the documentation will
be installed in /usr/local/doc/postgresql
,
but if the prefix is /opt/postgres
, then it
will be in /opt/postgres/doc
. The public C
header files of the client interfaces are installed into
includedir
and are namespace-clean. The
internal header files and the server header files are installed
into private directories under includedir
. See
the documentation of each interface for information about how to
access its header files. Finally, a private subdirectory will
also be created, if appropriate, under libdir
for dynamically loadable modules.
The options described in this section enable building of various PostgreSQL features that are not built by default. Most of these are non-default only because they require additional software, as described in Section 17.2.
--enable-nls[=LANGUAGES
]
Enables Native Language Support (NLS),
that is, the ability to display a program's messages in a
language other than English.
LANGUAGES
is an optional space-separated
list of codes of the languages that you want supported, for
example --enable-nls='de fr'
. (The intersection
between your list and the set of actually provided
translations will be computed automatically.) If you do not
specify a list, then all available translations are
installed.
To use this option, you will need an implementation of the Gettext API.
--with-perl
Build the PL/Perl server-side language.
--with-python
Build the PL/Python server-side language.
--with-tcl
Build the PL/Tcl server-side language.
--with-tclconfig=DIRECTORY
Tcl installs the file tclConfig.sh
, which
contains configuration information needed to build modules
interfacing to Tcl. This file is normally found automatically
at a well-known location, but if you want to use a different
version of Tcl you can specify the directory in which to look
for tclConfig.sh
.
--with-icu
Build with support for the ICU library, enabling use of ICU collation features (see Section 24.2). This requires the ICU4C package to be installed. The minimum required version of ICU4C is currently 4.2.
By default,
pkg-config
will be used to find the required compilation options. This is
supported for ICU4C version 4.6 and later.
For older versions, or if pkg-config is
not available, the variables ICU_CFLAGS
and ICU_LIBS
can be specified
to configure
, like in this example:
./configure ... --with-icu ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
(If ICU4C is in the default search path
for the compiler, then you still need to specify nonempty strings in
order to avoid use of pkg-config, for
example, ICU_CFLAGS=' '
.)
--with-llvm
Build with support for LLVM based JIT compilation (see Chapter 32). This requires the LLVM library to be installed. The minimum required version of LLVM is currently 3.9.
llvm-config
will be used to find the required compilation options.
llvm-config
, and then
llvm-config-$major-$minor
for all supported
versions, will be searched for in your PATH
. If
that would not yield the desired program,
use LLVM_CONFIG
to specify a path to the
correct llvm-config
. For example
./configure ... --with-llvm LLVM_CONFIG='/path/to/llvm/bin/llvm-config'
LLVM support requires a compatible
clang
compiler (specified, if necessary, using the
CLANG
environment variable), and a working C++
compiler (specified, if necessary, using the CXX
environment variable).
--with-lz4
Build with LZ4 compression support. This allows the use of LZ4 for compression of table data.
--with-ssl=LIBRARY
Build with support for SSL (encrypted)
connections. The only LIBRARY
supported is openssl
. This requires the
OpenSSL package to be installed.
configure
will check for the required
header files and libraries to make sure that your
OpenSSL installation is sufficient
before proceeding.
--with-openssl
Obsolete equivalent of --with-ssl=openssl
.
--with-gssapi
Build with support for GSSAPI authentication. On many systems, the
GSSAPI system (usually a part of the Kerberos installation) is not
installed in a location
that is searched by default (e.g., /usr/include
,
/usr/lib
), so you must use the options
--with-includes
and --with-libraries
in
addition to this option. configure
will check
for the required header files and libraries to make sure that
your GSSAPI installation is sufficient before proceeding.
--with-ldap
Build with LDAP
support for authentication and connection parameter lookup (see
Section 34.18 and
Section 21.10 for more information). On Unix,
this requires the OpenLDAP package to be
installed. On Windows, the default WinLDAP
library is used. configure
will check for the required
header files and libraries to make sure that your
OpenLDAP installation is sufficient before
proceeding.
--with-pam
--with-bsd-auth
Build with BSD Authentication support. (The BSD Authentication framework is currently only available on OpenBSD.)
--with-systemd
Build with support for systemd service notifications. This improves integration if the server is started under systemd but has no impact otherwise; see Section 19.3 for more information. libsystemd and the associated header files need to be installed to use this option.
--with-bonjour
Build with support for Bonjour automatic service discovery. This requires Bonjour support in your operating system. Recommended on macOS.
--with-uuid=LIBRARY
Build the uuid-ossp module
(which provides functions to generate UUIDs), using the specified
UUID library.
LIBRARY
must be one of:
bsd
to use the UUID functions found in FreeBSD
and some other BSD-derived systems
e2fs
to use the UUID library created by
the e2fsprogs
project; this library is present in most
Linux systems and in macOS, and can be obtained for other
platforms as well
ossp
to use the OSSP UUID library
--with-ossp-uuid
Obsolete equivalent of --with-uuid=ossp
.
--with-libxml
Build with libxml2, enabling SQL/XML support. Libxml2 version 2.6.23 or later is required for this feature.
To detect the required compiler and linker options, PostgreSQL will
query pkg-config
, if that is installed and knows
about libxml2. Otherwise the program xml2-config
,
which is installed by libxml2, will be used if it is found. Use
of pkg-config
is preferred, because it can deal
with multi-architecture installations better.
To use a libxml2 installation that is in an unusual location, you
can set pkg-config
-related environment
variables (see its documentation), or set the environment variable
XML2_CONFIG
to point to
the xml2-config
program belonging to the libxml2
installation, or set the variables XML2_CFLAGS
and XML2_LIBS
. (If pkg-config
is
installed, then to override its idea of where libxml2 is you must
either set XML2_CONFIG
or set
both XML2_CFLAGS
and XML2_LIBS
to
nonempty strings.)
--with-libxslt
Build with libxslt, enabling the
xml2
module to perform XSL transformations of XML.
--with-libxml
must be specified as well.
The options described in this section allow disabling certain PostgreSQL features that are built by default, but which might need to be turned off if the required software or system features are not available. Using these options is not recommended unless really necessary.
--without-readline
Prevents use of the Readline library (and libedit as well). This option disables command-line editing and history in psql.
--with-libedit-preferred
Favors the use of the BSD-licensed libedit library rather than GPL-licensed Readline. This option is significant only if you have both libraries installed; the default in that case is to use Readline.
--without-zlib
Prevents use of the Zlib library. This disables support for compressed archives in pg_dump and pg_restore.
--disable-spinlocks
Allow the build to succeed even if PostgreSQL has no CPU spinlock support for the platform. The lack of spinlock support will result in very poor performance; therefore, this option should only be used if the build aborts and informs you that the platform lacks spinlock support. If this option is required to build PostgreSQL on your platform, please report the problem to the PostgreSQL developers.
--disable-atomics
Disable use of CPU atomic operations. This option does nothing on platforms that lack such operations. On platforms that do have them, this will result in poor performance. This option is only useful for debugging or making performance comparisons.
--disable-thread-safety
Disable the thread-safety of client libraries. This prevents concurrent threads in libpq and ECPG programs from safely controlling their private connection handles. Use this only on platforms with deficient threading support.
--with-includes=DIRECTORIES
DIRECTORIES
is a colon-separated list of
directories that will be added to the list the compiler
searches for header files. If you have optional packages
(such as GNU Readline) installed in a non-standard
location,
you have to use this option and probably also the corresponding
--with-libraries
option.
Example: --with-includes=/opt/gnu/include:/usr/sup/include
.
--with-libraries=DIRECTORIES
DIRECTORIES
is a colon-separated list of
directories to search for libraries. You will probably have
to use this option (and the corresponding
--with-includes
option) if you have packages
installed in non-standard locations.
Example: --with-libraries=/opt/gnu/lib:/usr/sup/lib
.
--with-system-tzdata=DIRECTORY
PostgreSQL includes its own time zone database,
which it requires for date and time operations. This time zone
database is in fact compatible with the IANA time zone
database provided by many operating systems such as FreeBSD,
Linux, and Solaris, so it would be redundant to install it again.
When this option is used, the system-supplied time zone database
in DIRECTORY
is used instead of the one
included in the PostgreSQL source distribution.
DIRECTORY
must be specified as an
absolute path. /usr/share/zoneinfo
is a
likely directory on some operating systems. Note that the
installation routine will not detect mismatching or erroneous time
zone data. If you use this option, you are advised to run the
regression tests to verify that the time zone data you have
pointed to works correctly with PostgreSQL.
This option is mainly aimed at binary package distributors who know their target operating system well. The main advantage of using this option is that the PostgreSQL package won't need to be upgraded whenever any of the many local daylight-saving time rules change. Another advantage is that PostgreSQL can be cross-compiled more straightforwardly if the time zone database files do not need to be built during the installation.
--with-extra-version=STRING
Append STRING
to the PostgreSQL version number. You
can use this, for example, to mark binaries built from unreleased Git
snapshots or containing custom patches with an extra version string,
such as a git describe
identifier or a
distribution package release number.
--disable-rpath
Do not mark PostgreSQL's executables
to indicate that they should search for shared libraries in the
installation's library directory (see --libdir
).
On most platforms, this marking uses an absolute path to the
library directory, so that it will be unhelpful if you relocate
the installation later. However, you will then need to provide
some other way for the executables to find the shared libraries.
Typically this requires configuring the operating system's
dynamic linker to search the library directory; see
Section 17.5.1 for more detail.
It's fairly common, particularly for test builds, to adjust the
default port number with --with-pgport
.
The other options in this section are recommended only for advanced
users.
--with-pgport=NUMBER
Set NUMBER
as the default port number for
server and clients. The default is 5432. The port can always
be changed later on, but if you specify it here then both
server and clients will have the same default compiled in,
which can be very convenient. Usually the only good reason
to select a non-default value is if you intend to run multiple
PostgreSQL servers on the same machine.
--with-krb-srvnam=NAME
The default name of the Kerberos service principal used
by GSSAPI.
postgres
is the default. There's usually no
reason to change this unless you are building for a Windows
environment, in which case it must be set to upper case
POSTGRES
.
--with-segsize=SEGSIZE
Set the segment size, in gigabytes. Large tables are
divided into multiple operating-system files, each of size equal
to the segment size. This avoids problems with file size limits
that exist on many platforms. The default segment size, 1 gigabyte,
is safe on all supported platforms. If your operating system has
“largefile” support (which most do, nowadays), you can use
a larger segment size. This can be helpful to reduce the number of
file descriptors consumed when working with very large tables.
But be careful not to select a value larger than is supported
by your platform and the file systems you intend to use. Other
tools you might wish to use, such as tar, could
also set limits on the usable file size.
It is recommended, though not absolutely required, that this value
be a power of 2.
Note that changing this value breaks on-disk database compatibility,
meaning you cannot use pg_upgrade
to upgrade to
a build with a different segment size.
--with-blocksize=BLOCKSIZE
Set the block size, in kilobytes. This is the unit
of storage and I/O within tables. The default, 8 kilobytes,
is suitable for most situations; but other values may be useful
in special cases.
The value must be a power of 2 between 1 and 32 (kilobytes).
Note that changing this value breaks on-disk database compatibility,
meaning you cannot use pg_upgrade
to upgrade to
a build with a different block size.
--with-wal-blocksize=BLOCKSIZE
Set the WAL block size, in kilobytes. This is the unit
of storage and I/O within the WAL log. The default, 8 kilobytes,
is suitable for most situations; but other values may be useful
in special cases.
The value must be a power of 2 between 1 and 64 (kilobytes).
Note that changing this value breaks on-disk database compatibility,
meaning you cannot use pg_upgrade
to upgrade to
a build with a different WAL block size.
Most of the options in this section are only of interest for
developing or debugging PostgreSQL.
They are not recommended for production builds, except
for --enable-debug
, which can be useful to enable
detailed bug reports in the unlucky event that you encounter a bug.
On platforms supporting DTrace, --enable-dtrace
may also be reasonable to use in production.
When building an installation that will be used to develop code inside
the server, it is recommended to use at least the
options --enable-debug
and --enable-cassert
.
--enable-debug
Compiles all programs and libraries with debugging symbols. This means that you can run the programs in a debugger to analyze problems. This enlarges the size of the installed executables considerably, and on non-GCC compilers it usually also disables compiler optimization, causing slowdowns. However, having the symbols available is extremely helpful for dealing with any problems that might arise. Currently, this option is recommended for production installations only if you use GCC. But you should always have it on if you are doing development work or running a beta version.
--enable-cassert
Enables assertion checks in the server, which test for many “cannot happen” conditions. This is invaluable for code development purposes, but the tests can slow down the server significantly. Also, having the tests turned on won't necessarily enhance the stability of your server! The assertion checks are not categorized for severity, and so what might be a relatively harmless bug will still lead to server restarts if it triggers an assertion failure. This option is not recommended for production use, but you should have it on for development work or when running a beta version.
--enable-tap-tests
Enable tests using the Perl TAP tools. This requires a Perl
installation and the Perl module IPC::Run
.
See Section 33.4 for more information.
--enable-depend
Enables automatic dependency tracking. With this option, the makefiles are set up so that all affected object files will be rebuilt when any header file is changed. This is useful if you are doing development work, but is just wasted overhead if you intend only to compile once and install. At present, this option only works with GCC.
--enable-coverage
If using GCC, all programs and libraries are compiled with code coverage testing instrumentation. When run, they generate files in the build directory with code coverage metrics. See Section 33.5 for more information. This option is for use only with GCC and when doing development work.
--enable-profiling
If using GCC, all programs and libraries are compiled so they
can be profiled. On backend exit, a subdirectory will be created
that contains the gmon.out
file containing
profile data.
This option is for use only with GCC and when doing development work.
--enable-dtrace
Compiles PostgreSQL with support for the dynamic tracing tool DTrace. See Section 28.5 for more information.
To point to the dtrace
program, the
environment variable DTRACE
can be set. This
will often be necessary because dtrace
is
typically installed under /usr/sbin
,
which might not be in your PATH
.
Extra command-line options for the dtrace
program
can be specified in the environment variable
DTRACEFLAGS
. On Solaris,
to include DTrace support in a 64-bit binary, you must specify
DTRACEFLAGS="-64"
. For example,
using the GCC compiler:
./configure CC='gcc -m64' --enable-dtrace DTRACEFLAGS='-64' ...
Using Sun's compiler:
./configure CC='/opt/SUNWspro/bin/cc -xtarget=native64' --enable-dtrace DTRACEFLAGS='-64' ...
configure
Environment Variables
In addition to the ordinary command-line options described above,
configure
responds to a number of environment
variables.
You can specify environment variables on the
configure
command line, for example:
./configure CC=/opt/bin/gcc CFLAGS='-O2 -pipe'
In this usage an environment variable is little different from a command-line option. You can also set such variables beforehand:
export CC=/opt/bin/gcc
export CFLAGS='-O2 -pipe'
./configure
This usage can be convenient because many programs' configuration scripts respond to these variables in similar ways.
The most commonly used of these environment variables are
CC
and CFLAGS
.
If you prefer a C compiler different from the one
configure
picks, you can set the
variable CC
to the program of your choice.
By default, configure
will pick
gcc
if available, else the platform's
default (usually cc
). Similarly, you can override the
default compiler flags if needed with the CFLAGS
variable.
Here is a list of the significant variables that can be set in this manner:
BISON
Bison program
CC
C compiler
CFLAGS
options to pass to the C compiler
CLANG
path to clang
program used to process source code
for inlining when compiling with --with-llvm
CPP
C preprocessor
CPPFLAGS
options to pass to the C preprocessor
CXX
C++ compiler
CXXFLAGS
options to pass to the C++ compiler
DTRACE
location of the dtrace
program
DTRACEFLAGS
options to pass to the dtrace
program
FLEX
Flex program
LDFLAGS
options to use when linking either executables or shared libraries
LDFLAGS_EX
additional options for linking executables only
LDFLAGS_SL
additional options for linking shared libraries only
LLVM_CONFIG
llvm-config
program used to locate the
LLVM installation
MSGFMT
msgfmt
program for native language support
PERL
Perl interpreter program. This will be used to determine the
dependencies for building PL/Perl. The default is
perl
.
PYTHON
Python interpreter program. This will be used to
determine the dependencies for building PL/Python. Also,
whether Python 2 or 3 is specified here (or otherwise
implicitly chosen) determines which variant of the PL/Python
language becomes available. See
Section 46.1
for more information. If this is not set, the following are probed
in this order: python python3 python2
.
TCLSH
Tcl interpreter program. This will be used to
determine the dependencies for building PL/Tcl.
If this is not set, the following are probed in this
order: tclsh tcl tclsh8.6 tclsh86 tclsh8.5 tclsh85
tclsh8.4 tclsh84
.
XML2_CONFIG
xml2-config
program used to locate the
libxml2 installation
Sometimes it is useful to add compiler flags after-the-fact to the set
that were chosen by configure
. An important example is
that gcc's -Werror
option cannot be included
in the CFLAGS
passed to configure
, because
it will break many of configure
's built-in tests. To add
such flags, include them in the COPT
environment variable
while running make
. The contents of COPT
are added to both the CFLAGS
and LDFLAGS
options set up by configure
. For example, you could do
make COPT='-Werror'
or
export COPT='-Werror'
make
If using GCC, it is best to build with an optimization level of
at least -O1
, because using no optimization
(-O0
) disables some important compiler warnings (such
as the use of uninitialized variables). However, non-zero
optimization levels can complicate debugging because stepping
through compiled code will usually not match up one-to-one with
source code lines. If you get confused while trying to debug
optimized code, recompile the specific files of interest with
-O0
. An easy way to do this is by passing an option
to make: make PROFILE=-O0 file.o
.
The COPT
and PROFILE
environment variables are
actually handled identically by the PostgreSQL
makefiles. Which to use is a matter of preference, but a common habit
among developers is to use PROFILE
for one-time flag
adjustments, while COPT
might be kept set all the time.
On some systems with shared libraries you need to tell the system how to find the newly installed shared libraries. The systems on which this is not necessary include FreeBSD, HP-UX, Linux, NetBSD, OpenBSD, and Solaris.
The method to set the shared library search path varies between
platforms, but the most widely-used method is to set the
environment variable LD_LIBRARY_PATH
like so: In Bourne
shells (sh
, ksh
, bash
, zsh
):
LD_LIBRARY_PATH=/usr/local/pgsql/lib export LD_LIBRARY_PATH
or in csh
or tcsh
:
setenv LD_LIBRARY_PATH /usr/local/pgsql/lib
Replace /usr/local/pgsql/lib
with whatever you set
to in Step 1.
You should put these commands into a shell start-up file such as
--libdir
/etc/profile
or ~/.bash_profile
. Some
good information about the caveats associated with this method can
be found at http://xahlee.info/UnixResource_dir/_/ldpath.html.
On some systems it might be preferable to set the environment
variable LD_RUN_PATH
before
building.
On Cygwin, put the library
directory in the PATH
or move the
.dll
files into the bin
directory.
If in doubt, refer to the manual pages of your system (perhaps
ld.so
or rld
). If you later
get a message like:
psql: error in loading shared libraries libpq.so.2.1: cannot open shared object file: No such file or directory
then this step was necessary. Simply take care of it then.
If you are on Linux and you have root access, you can run:
/sbin/ldconfig /usr/local/pgsql/lib
(or equivalent directory) after installation to enable the
run-time linker to find the shared libraries faster. Refer to the
manual page of ldconfig
for more information. On
FreeBSD, NetBSD, and OpenBSD the command is:
/sbin/ldconfig -m /usr/local/pgsql/lib
instead. Other systems are not known to have an equivalent command.
If you installed into /usr/local/pgsql
or some other
location that is not searched for programs by default, you should
add /usr/local/pgsql/bin
(or whatever you set
to in Step 1)
into your --bindir
PATH
. Strictly speaking, this is not
necessary, but it will make the use of PostgreSQL
much more convenient.
To do this, add the following to your shell start-up file, such as
~/.bash_profile
(or /etc/profile
, if you
want it to affect all users):
PATH=/usr/local/pgsql/bin:$PATH export PATH
If you are using csh
or tcsh
, then use this command:
set path = ( /usr/local/pgsql/bin $path )
To enable your system to find the man documentation, you need to add lines like the following to a shell start-up file unless you installed into a location that is searched by default:
MANPATH=/usr/local/pgsql/share/man:$MANPATH export MANPATH
The environment variables PGHOST
and PGPORT
specify to client applications the host and port of the database
server, overriding the compiled-in defaults. If you are going to
run client applications remotely then it is convenient if every
user that plans to use the database sets PGHOST
. This
is not required, however; the settings can be communicated via command
line options to most client programs.
A platform (that is, a CPU architecture and operating system combination) is considered supported by the PostgreSQL development community if the code contains provisions to work on that platform and it has recently been verified to build and pass its regression tests on that platform. Currently, most testing of platform compatibility is done automatically by test machines in the PostgreSQL Build Farm. If you are interested in using PostgreSQL on a platform that is not represented in the build farm, but on which the code works or can be made to work, you are strongly encouraged to set up a build farm member machine so that continued compatibility can be assured.
In general, PostgreSQL can be expected to work on
these CPU architectures: x86, x86_64, IA64, PowerPC,
PowerPC 64, S/390, S/390x, Sparc, Sparc 64, ARM, MIPS, MIPSEL,
and PA-RISC. Code support exists for M68K, M32R, and VAX, but these
architectures are not known to have been tested recently. It is often
possible to build on an unsupported CPU type by configuring with
--disable-spinlocks
, but performance will be poor.
PostgreSQL can be expected to work on these operating systems: Linux (all recent distributions), Windows (XP and later), FreeBSD, OpenBSD, NetBSD, macOS, AIX, HP/UX, and Solaris. Other Unix-like systems may also work but are not currently being tested. In most cases, all CPU architectures supported by a given operating system will work. Look in Section 17.7 below to see if there is information specific to your operating system, particularly if using an older system.
If you have installation problems on a platform that is known
to be supported according to recent build farm results, please report
it to <pgsql-bugs@lists.postgresql.org>
. If you are interested
in porting PostgreSQL to a new platform,
<pgsql-hackers@lists.postgresql.org>
is the appropriate place
to discuss that.
This section documents additional platform-specific issues regarding the installation and setup of PostgreSQL. Be sure to read the installation instructions, and in particular Section 17.2 as well. Also, check Chapter 33 regarding the interpretation of regression test results.
Platforms that are not covered here have no known platform-specific installation issues.
PostgreSQL works on AIX, but AIX versions before about 6.1 have
various issues and are not recommended.
You can use GCC or the native IBM compiler xlc
.
AIX can be somewhat peculiar with regards to the way it does memory management. You can have a server with many multiples of gigabytes of RAM free, but still get out of memory or address space errors when running applications. One example is loading of extensions failing with unusual errors. For example, running as the owner of the PostgreSQL installation:
=# CREATE EXTENSION plperl; ERROR: could not load library "/opt/dbs/pgsql/lib/plperl.so": A memory address is not in the address space for the process.
Running as a non-owner in the group possessing the PostgreSQL installation:
=# CREATE EXTENSION plperl; ERROR: could not load library "/opt/dbs/pgsql/lib/plperl.so": Bad address
Another example is out of memory errors in the PostgreSQL server logs, with every memory allocation near or greater than 256 MB failing.
The overall cause of all these problems is the default bittedness and memory model used by the server process. By default, all binaries built on AIX are 32-bit. This does not depend upon hardware type or kernel in use. These 32-bit processes are limited to 4 GB of memory laid out in 256 MB segments using one of a few models. The default allows for less than 256 MB in the heap as it shares a single segment with the stack.
In the case of the plperl
example, above,
check your umask and the permissions of the binaries in your
PostgreSQL installation. The binaries involved in that example
were 32-bit and installed as mode 750 instead of 755. Due to the
permissions being set in this fashion, only the owner or a member
of the possessing group can load the library. Since it isn't
world-readable, the loader places the object into the process'
heap instead of the shared library segments where it would
otherwise be placed.
The “ideal” solution for this is to use a 64-bit build of PostgreSQL, but that is not always practical, because systems with 32-bit processors can build, but not run, 64-bit binaries.
If a 32-bit binary is desired, set LDR_CNTRL
to
MAXDATA=0x
,
where 1 <= n <= 8, before starting the PostgreSQL server,
and try different values and n
0000000postgresql.conf
settings to find a configuration that works satisfactorily. This
use of LDR_CNTRL
tells AIX that you want the
server to have MAXDATA
bytes set aside for the
heap, allocated in 256 MB segments. When you find a workable
configuration,
ldedit
can be used to modify the binaries so
that they default to using the desired heap size. PostgreSQL can
also be rebuilt, passing configure
LDFLAGS="-Wl,-bmaxdata:0x
to achieve the same effect.
n
0000000"
For a 64-bit build, set OBJECT_MODE
to 64 and
pass CC="gcc -maix64"
and LDFLAGS="-Wl,-bbigtoc"
to configure
. (Options for
xlc
might differ.) If you omit the export of
OBJECT_MODE
, your build may fail with linker errors. When
OBJECT_MODE
is set, it tells AIX's build utilities
such as ar
, as
, and ld
what
type of objects to default to handling.
By default, overcommit of paging space can happen. While we have not seen this occur, AIX will kill processes when it runs out of memory and the overcommit is accessed. The closest to this that we have seen is fork failing because the system decided that there was not enough memory for another process. Like many other parts of AIX, the paging space allocation method and out-of-memory kill is configurable on a system- or process-wide basis if this becomes a problem.
PostgreSQL can be built using Cygwin, a Linux-like environment for Windows, but that method is inferior to the native Windows build (see Chapter 18) and running a server under Cygwin is no longer recommended.
When building from source, proceed according to the Unix-style
installation procedure (i.e., ./configure;
make
; etc.), noting the following Cygwin-specific
differences:
Set your path to use the Cygwin bin directory before the Windows utilities. This will help prevent problems with compilation.
The adduser
command is not supported; use
the appropriate user management application on Windows NT,
2000, or XP. Otherwise, skip this step.
The su
command is not supported; use ssh to
simulate su on Windows NT, 2000, or XP. Otherwise, skip this
step.
OpenSSL is not supported.
Start cygserver
for shared memory support.
To do this, enter the command /usr/sbin/cygserver
&
. This program needs to be running anytime you
start the PostgreSQL server or initialize a database cluster
(initdb
). The
default cygserver
configuration may need to
be changed (e.g., increase SEMMNS
) to prevent
PostgreSQL from failing due to a lack of system resources.
Building might fail on some systems where a locale other than
C is in use. To fix this, set the locale to C by doing
export LANG=C.utf8
before building, and then
setting it back to the previous setting after you have installed
PostgreSQL.
The parallel regression tests (make check
)
can generate spurious regression test failures due to
overflowing the listen()
backlog queue
which causes connection refused errors or hangs. You can limit
the number of connections using the make
variable MAX_CONNECTIONS
thus:
make MAX_CONNECTIONS=5 check
(On some systems you can have up to about 10 simultaneous connections.)
It is possible to install cygserver
and the
PostgreSQL server as Windows NT services. For information on how
to do this, please refer to the README
document included with the PostgreSQL binary package on Cygwin.
It is installed in the
directory /usr/share/doc/Cygwin
.
To build PostgreSQL from source on macOS, you will need to install Apple's command line developer tools, which can be done by issuing
xcode-select --install
(note that this will pop up a GUI dialog window for confirmation). You may or may not wish to also install Xcode.
On recent macOS releases, it's necessary to
embed the “sysroot” path in the include switches used to
find some system header files. This results in the outputs of
the configure script varying depending on
which SDK version was used during configure.
That shouldn't pose any problem in simple scenarios, but if you are
trying to do something like building an extension on a different machine
than the server code was built on, you may need to force use of a
different sysroot path. To do that, set PG_SYSROOT
,
for example
make PG_SYSROOT=/desired/path
all
To find out the appropriate path on your machine, run
xcrun --show-sdk-path
Note that building an extension using a different sysroot version than was used to build the core server is not really recommended; in the worst case it could result in hard-to-debug ABI inconsistencies.
You can also select a non-default sysroot path when configuring, by
specifying PG_SYSROOT
to configure:
./configure ... PG_SYSROOT=/desired/path
This would primarily be useful to cross-compile for some other macOS version. There is no guarantee that the resulting executables will run on the current host.
To suppress the -isysroot
options altogether, use
./configure ... PG_SYSROOT=none
(any nonexistent pathname will work). This might be useful if you wish to build with a non-Apple compiler, but beware that that case is not tested or supported by the PostgreSQL developers.
macOS's “System Integrity
Protection” (SIP) feature breaks make check
,
because it prevents passing the needed setting
of DYLD_LIBRARY_PATH
down to the executables being
tested. You can work around that by doing make
install
before make check
.
Most PostgreSQL developers just turn off SIP, though.
PostgreSQL for Windows can be built using MinGW, a Unix-like build environment for Microsoft operating systems, or using Microsoft's Visual C++ compiler suite. The MinGW build procedure uses the normal build system described in this chapter; the Visual C++ build works completely differently and is described in Chapter 18.
The native Windows port requires a 32 or 64-bit version of Windows
2000 or later. Earlier operating systems do
not have sufficient infrastructure (but Cygwin may be used on
those). MinGW, the Unix-like build tools, and MSYS, a collection
of Unix tools required to run shell scripts
like configure
, can be downloaded
from http://www.mingw.org/. Neither is
required to run the resulting binaries; they are needed only for
creating the binaries.
To build 64 bit binaries using MinGW, install the 64 bit tool set
from https://mingw-w64.org/, put its bin
directory in the PATH
, and run
configure
with the
--host=x86_64-w64-mingw32
option.
After you have everything installed, it is suggested that you
run psql
under CMD.EXE
, as the MSYS console has
buffering issues.
If PostgreSQL on Windows crashes, it has the ability to generate
minidumps that can be used to track down the cause
for the crash, similar to core dumps on Unix. These dumps can be
read using the Windows Debugger Tools or using
Visual Studio. To enable the generation of dumps
on Windows, create a subdirectory named crashdumps
inside the cluster data directory. The dumps will then be written
into this directory with a unique name based on the identifier of
the crashing process and the current time of the crash.
PostgreSQL is well-supported on Solaris. The more up to date your operating system, the fewer issues you will experience.
You can build with either GCC or Sun's compiler suite. For
better code optimization, Sun's compiler is strongly recommended
on the SPARC architecture. If
you are using Sun's compiler, be careful not to select
/usr/ucb/cc
;
use /opt/SUNWspro/bin/cc
.
You can download Sun Studio from https://www.oracle.com/technetwork/server-storage/solarisstudio/downloads/. Many GNU tools are integrated into Solaris 10, or they are present on the Solaris companion CD. If you need packages for older versions of Solaris, you can find these tools at http://www.sunfreeware.com. If you prefer sources, look at https://www.gnu.org/prep/ftp.
If configure
complains about a failed test
program, this is probably a case of the run-time linker being
unable to find some library, probably libz, libreadline or some
other non-standard library such as libssl. To point it to the
right location, set the LDFLAGS
environment
variable on the configure
command line, e.g.,
configure ... LDFLAGS="-R /usr/sfw/lib:/opt/sfw/lib:/usr/local/lib"
See the ld man page for more information.
On the SPARC architecture, Sun Studio is strongly recommended for
compilation. Try using the -xO5
optimization
flag to generate significantly faster binaries. Do not use any
flags that modify behavior of floating-point operations
and errno
processing (e.g.,
-fast
).
If you do not have a reason to use 64-bit binaries on SPARC, prefer the 32-bit version. The 64-bit operations are slower and 64-bit binaries are slower than the 32-bit variants. On the other hand, 32-bit code on the AMD64 CPU family is not native, so 32-bit code is significantly slower on that CPU family.
Yes, using DTrace is possible. See Section 28.5 for further information.
If you see the linking of the postgres
executable abort with an
error message like:
Undefined first referenced symbol in file AbortTransaction utils/probes.o CommitTransaction utils/probes.o ld: fatal: Symbol referencing errors. No output written to postgres collect2: ld returned 1 exit status make: *** [postgres] Error 1
your DTrace installation is too old to handle probes in static functions. You need Solaris 10u4 or newer to use DTrace.
Table of Contents
It is recommended that most users download the binary distribution for Windows, available as a graphical installer package from the PostgreSQL website at https://www.postgresql.org/download/. Building from source is only intended for people developing PostgreSQL or extensions.
There are several different ways of building PostgreSQL on Windows. The simplest way to build with Microsoft tools is to install Visual Studio 2022 and use the included compiler. It is also possible to build with the full Microsoft Visual C++ 2013 to 2022. In some cases that requires the installation of the Windows SDK in addition to the compiler.
It is also possible to build PostgreSQL using the GNU compiler tools provided by MinGW, or using Cygwin for older versions of Windows.
Building using MinGW or Cygwin uses the normal build system, see Chapter 17 and the specific notes in Section 17.7.4 and Section 17.7.2. To produce native 64 bit binaries in these environments, use the tools from MinGW-w64. These tools can also be used to cross-compile for 32 bit and 64 bit Windows targets on other hosts, such as Linux and macOS. Cygwin is not recommended for running a production server, and it should only be used for running on older versions of Windows where the native build does not work. The official binaries are built using Visual Studio.
Native builds of psql don't support command line editing. The Cygwin build does support command line editing, so it should be used where psql is needed for interactive use on Windows.
PostgreSQL can be built using the Visual C++ compiler suite from Microsoft. These compilers can be either from Visual Studio, Visual Studio Express or some versions of the Microsoft Windows SDK. If you do not already have a Visual Studio environment set up, the easiest ways are to use the compilers from Visual Studio 2022 or those in the Windows SDK 10, which are both free downloads from Microsoft.
Both 32-bit and 64-bit builds are possible with the Microsoft Compiler suite. 32-bit PostgreSQL builds are possible with Visual Studio 2013 to Visual Studio 2022, as well as standalone Windows SDK releases 8.1a to 10. 64-bit PostgreSQL builds are supported with Microsoft Windows SDK version 8.1a to 10 or Visual Studio 2013 and above. Compilation is supported down to Windows 7 and Windows Server 2008 R2 SP1 when building with Visual Studio 2013 to Visual Studio 2022.
The tools for building using Visual C++ or
Platform SDK are in the
src\tools\msvc
directory. When building, make sure
there are no tools from MinGW or
Cygwin present in your system PATH. Also, make
sure you have all the required Visual C++ tools available in the PATH. In
Visual Studio, start the
Visual Studio Command Prompt.
If you wish to build a 64-bit version, you must use the 64-bit version of
the command, and vice versa.
Starting with Visual Studio 2017 this can be
done from the command line using VsDevCmd.bat
, see
-help
for the available options and their default values.
vsvars32.bat
is available in
Visual Studio 2015 and earlier versions for the
same purpose.
From the Visual Studio Command Prompt, you can
change the targeted CPU architecture, build type, and target OS by using the
vcvarsall.bat
command, e.g.,
vcvarsall.bat x64 10.0.10240.0
to target Windows 10
with a 64-bit release build. See -help
for the other
options of vcvarsall.bat
. All commands should be run from
the src\tools\msvc
directory.
Before you build, you can create the file config.pl
to reflect any configuration options you want to change, or the paths to
any third party libraries to use. The complete configuration is determined
by first reading and parsing the file config_default.pl
,
and then apply any changes from config.pl
. For example,
to specify the location of your Python installation,
put the following in config.pl
:
$config->{python} = 'c:\python26';
You only need to specify those parameters that are different from what's in
config_default.pl
.
If you need to set any other environment variables, create a file called
buildenv.pl
and put the required commands there. For
example, to add the path for bison when it's not in the PATH, create a file
containing:
$ENV{PATH}=$ENV{PATH} . ';c:\some\where\bison\bin';
To pass additional command line arguments to the Visual Studio build command (msbuild or vcbuild):
$ENV{MSBFLAGS}="/m";
The following additional products are required to build
PostgreSQL. Use the
config.pl
file to specify which directories the libraries
are available in.
If your build environment doesn't ship with a supported version of the Microsoft Windows SDK it is recommended that you upgrade to the latest version (currently version 10), available for download from https://www.microsoft.com/download.
You must always include the Windows Headers and Libraries part of the SDK. If you install a Windows SDK including the Visual C++ Compilers, you don't need Visual Studio to build. Note that as of Version 8.0a the Windows SDK no longer ships with a complete command-line build environment.
ActiveState Perl is required to run the build generation scripts. MinGW or Cygwin Perl will not work. It must also be present in the PATH. Binaries can be downloaded from https://www.activestate.com (Note: version 5.8.3 or later is required, the free Standard Distribution is sufficient).
The following additional products are not required to get started,
but are required to build the complete package. Use the
config.pl
file to specify which directories the libraries
are available in.
Required for building PL/Tcl (Note: version 8.4 is required, the free Standard Distribution is sufficient).
Bison and Flex are required to build from Git, but not required when building from a release file. Only Bison 1.875 or versions 2.2 and later will work. Flex must be version 2.5.31 or later.
Both Bison and Flex are included in the msys tool suite, available from http://www.mingw.org/wiki/MSYS as part of the MinGW compiler suite.
You will need to add the directory containing
flex.exe
and bison.exe
to the
PATH environment variable in buildenv.pl
unless
they are already in PATH. In the case of MinGW, the directory is the
\msys\1.0\bin
subdirectory of your MinGW
installation directory.
The Bison distribution from GnuWin32 appears to have a bug that
causes Bison to malfunction when installed in a directory with
spaces in the name, such as the default location on English
installations C:\Program Files\GnuWin32
.
Consider installing into C:\GnuWin32
or use the
NTFS short name path to GnuWin32 in your PATH environment setting
(e.g., C:\PROGRA~1\GnuWin32
).
Diff is required to run the regression tests, and can be downloaded from http://gnuwin32.sourceforge.net.
Gettext is required to build with NLS support, and can be downloaded from http://gnuwin32.sourceforge.net. Note that binaries, dependencies and developer files are all needed.
Required for GSSAPI authentication support. MIT Kerberos can be downloaded from https://web.mit.edu/Kerberos/dist/index.html.
Required for XML support. Binaries can be downloaded from https://zlatkovic.com/pub/libxml or source from http://xmlsoft.org. Note that libxml2 requires iconv, which is available from the same download location.
Required for supporting LZ4 compression method for compressing the table data. Binaries and source can be downloaded from https://github.com/lz4/lz4/releases.
Required for SSL support. Binaries can be downloaded from https://slproweb.com/products/Win32OpenSSL.html or source from https://www.openssl.org.
Required for UUID-OSSP support (contrib only). Source can be downloaded from http://www.ossp.org/pkg/lib/uuid/.
Required for building PL/Python. Binaries can be downloaded from https://www.python.org.
Required for compression support in pg_dump and pg_restore. Binaries can be downloaded from https://www.zlib.net.
PostgreSQL will only build for the x64 architecture on 64-bit Windows, there is no support for Itanium processors.
Mixing 32- and 64-bit versions in the same build tree is not supported. The build system will automatically detect if it's running in a 32- or 64-bit environment, and build PostgreSQL accordingly. For this reason, it is important to start the correct command prompt before building.
To use a server-side third party library such as python or OpenSSL, this library must also be 64-bit. There is no support for loading a 32-bit library in a 64-bit server. Several of the third party libraries that PostgreSQL supports may only be available in 32-bit versions, in which case they cannot be used with 64-bit PostgreSQL.
To build all of PostgreSQL in release configuration (the default), run the command:
build
To build all of PostgreSQL in debug configuration, run the command:
build DEBUG
To build just a single project, for example psql, run the commands:
build psql
build DEBUG psql
To change the default build configuration to debug, put the following
in the buildenv.pl
file:
$ENV{CONFIG}="Debug";
It is also possible to build from inside the Visual Studio GUI. In this case, you need to run:
perl mkvcbuild.pl
from the command prompt, and then open the generated
pgsql.sln
(in the root directory of the source tree)
in Visual Studio.
Most of the time, the automatic dependency tracking in Visual Studio will
handle changed files. But if there have been large changes, you may need
to clean the installation. To do this, simply run the
clean.bat
command, which will automatically clean out
all generated files. You can also run it with the
dist
parameter, in which case it will behave like
make distclean
and remove the flex/bison output files
as well.
By default, all files are written into a subdirectory of the
debug
or release
directories. To
install these files using the standard layout, and also generate the files
required to initialize and use the database, run the command:
install c:\destination\directory
If you want to install only the client applications and interface libraries, then you can use these commands:
install c:\destination\directory client
To run the regression tests, make sure you have completed the build of all
required parts first. Also, make sure that the DLLs required to load all
parts of the system (such as the Perl and Python DLLs for the procedural
languages) are present in the system path. If they are not, set it through
the buildenv.pl
file. To run the tests, run one of
the following commands from the src\tools\msvc
directory:
vcregress check
vcregress installcheck
vcregress plcheck
vcregress contribcheck
vcregress modulescheck
vcregress ecpgcheck
vcregress isolationcheck
vcregress bincheck
vcregress recoverycheck
vcregress taptest
vcregress upgradecheck
To change the schedule used (default is parallel), append it to the command line like:
vcregress check serial
vcregress taptest
can be used to run the TAP tests
of a target directory, like:
vcregress taptest src\bin\initdb\
For more information about the regression tests, see Chapter 33.
Running the regression tests on client programs with
vcregress bincheck
, on recovery tests with
vcregress recoverycheck
, or TAP tests specified with
vcregress taptest
requires an additional Perl module
to be installed:
As of this writing, IPC::Run
is not included in the
ActiveState Perl installation, nor in the ActiveState Perl Package
Manager (PPM) library. To install, download the
IPC-Run-<version>.tar.gz
source archive from CPAN,
at https://metacpan.org/release/IPC-Run, and
uncompress. Edit the buildenv.pl
file, and add a PERL5LIB
variable to point to the lib
subdirectory from the
extracted archive. For example:
$ENV{PERL5LIB}=$ENV{PERL5LIB} . ';c:\IPC-Run-0.94\lib';
The TAP tests run with vcregress
support the
environment variables PROVE_TESTS
, that is expanded
automatically using the name patterns given, and
PROVE_FLAGS
. These can be set on a Windows terminal,
before running vcregress
:
set PROVE_FLAGS=--timer --jobs 2 set PROVE_TESTS=t/020*.pl t/010*.pl
It is also possible to set up those parameters in
buildenv.pl
:
$ENV{PROVE_FLAGS}='--timer --jobs 2' $ENV{PROVE_TESTS}='t/020*.pl t/010*.pl'
Some of the TAP tests depend on a set of external commands that would
optionally trigger tests related to them. Each one of those variables
can be set or unset in buildenv.pl
:
GZIP_PROGRAM
Path to a gzip command. The default is
gzip
, that would be the command found in
PATH
.
LZ4
Path to a lz4 command. The default is
lz4
, that would be the command found in
PATH
.
TAR
Path to a tar command. The default is
tar
, that would be the command found in
PATH
.
Table of Contents
This chapter discusses how to set up and run the database server, and its interactions with the operating system.
The directions in this chapter assume that you are working with plain PostgreSQL without any additional infrastructure, for example a copy that you built from source according to the directions in the preceding chapters. If you are working with a pre-packaged or vendor-supplied version of PostgreSQL, it is likely that the packager has made special provisions for installing and starting the database server according to your system's conventions. Consult the package-level documentation for details.
As with any server daemon that is accessible to the outside world,
it is advisable to run PostgreSQL under a
separate user account. This user account should only own the data
that is managed by the server, and should not be shared with other
daemons. (For example, using the user nobody
is a bad
idea.) In particular, it is advisable that this user account not own
the PostgreSQL executable files, to ensure
that a compromised server process could not modify those executables.
Pre-packaged versions of PostgreSQL will typically create a suitable user account automatically during package installation.
To add a Unix user account to your system, look for a command
useradd
or adduser
. The user
name postgres is often used, and is assumed
throughout this book, but you can use another name if you like.
Before you can do anything, you must initialize a database storage
area on disk. We call this a database cluster.
(The SQL standard uses the term catalog cluster.) A
database cluster is a collection of databases that is managed by a
single instance of a running database server. After initialization, a
database cluster will contain a database named postgres
,
which is meant as a default database for use by utilities, users and third
party applications. The database server itself does not require the
postgres
database to exist, but many external utility
programs assume it exists. Another database created within each cluster
during initialization is called
template1
. As the name suggests, this will be used
as a template for subsequently created databases; it should not be
used for actual work. (See Chapter 23 for
information about creating new databases within a cluster.)
In file system terms, a database cluster is a single directory
under which all data will be stored. We call this the data
directory or data area. It is
completely up to you where you choose to store your data. There is no
default, although locations such as
/usr/local/pgsql/data
or
/var/lib/pgsql/data
are popular.
The data directory must be initialized before being used, using the program
initdb
which is installed with PostgreSQL.
If you are using a pre-packaged version
of PostgreSQL, it may well have a specific
convention for where to place the data directory, and it may also
provide a script for creating the data directory. In that case you
should use that script in preference to
running initdb
directly.
Consult the package-level documentation for details.
To initialize a database cluster manually,
run initdb
and specify the desired
file system location of the database cluster with the
-D
option, for example:
$
initdb -D /usr/local/pgsql/data
Note that you must execute this command while logged into the PostgreSQL user account, which is described in the previous section.
Alternatively, you can run initdb
via
the pg_ctl
program like so:
$
pg_ctl -D /usr/local/pgsql/data initdb
This may be more intuitive if you are
using pg_ctl
for starting and stopping the
server (see Section 19.3), so
that pg_ctl
would be the sole command you use
for managing the database server instance.
initdb
will attempt to create the directory you
specify if it does not already exist. Of course, this will fail if
initdb
does not have permissions to write in the
parent directory. It's generally recommendable that the
PostgreSQL user own not just the data
directory but its parent directory as well, so that this should not
be a problem. If the desired parent directory doesn't exist either,
you will need to create it first, using root privileges if the
grandparent directory isn't writable. So the process might look
like this:
root#mkdir /usr/local/pgsql
root#chown postgres /usr/local/pgsql
root#su postgres
postgres$initdb -D /usr/local/pgsql/data
initdb
will refuse to run if the data directory
exists and already contains files; this is to prevent accidentally
overwriting an existing installation.
Because the data directory contains all the data stored in the
database, it is essential that it be secured from unauthorized
access. initdb
therefore revokes access
permissions from everyone but the
PostgreSQL user, and optionally, group.
Group access, when enabled, is read-only. This allows an unprivileged
user in the same group as the cluster owner to take a backup of the
cluster data or perform other operations that only require read access.
Note that enabling or disabling group access on an existing cluster requires
the cluster to be shut down and the appropriate mode to be set on all
directories and files before restarting
PostgreSQL. Otherwise, a mix of modes might
exist in the data directory. For clusters that allow access only by the
owner, the appropriate modes are 0700
for directories
and 0600
for files. For clusters that also allow
reads by the group, the appropriate modes are 0750
for directories and 0640
for files.
However, while the directory contents are secure, the default
client authentication setup allows any local user to connect to the
database and even become the database superuser. If you do not
trust other local users, we recommend you use one of
initdb
's -W
, --pwprompt
or --pwfile
options to assign a password to the
database superuser.
Also, specify -A scram-sha-256
so that the default trust
authentication
mode is not used; or modify the generated pg_hba.conf
file after running initdb
, but
before you start the server for the first time. (Other
reasonable approaches include using peer
authentication
or file system permissions to restrict connections. See Chapter 21 for more information.)
initdb
also initializes the default
locale for the database cluster.
Normally, it will just take the locale settings in the environment
and apply them to the initialized database. It is possible to
specify a different locale for the database; more information about
that can be found in Section 24.1. The default sort order used
within the particular database cluster is set by
initdb
, and while you can create new databases using
different sort order, the order used in the template databases that initdb
creates cannot be changed without dropping and recreating them.
There is also a performance impact for using locales
other than C
or POSIX
. Therefore, it is
important to make this choice correctly the first time.
initdb
also sets the default character set encoding
for the database cluster. Normally this should be chosen to match the
locale setting. For details see Section 24.3.
Non-C
and non-POSIX
locales rely on the
operating system's collation library for character set ordering.
This controls the ordering of keys stored in indexes. For this reason,
a cluster cannot switch to an incompatible collation library version,
either through snapshot restore, binary streaming replication, a
different operating system, or an operating system upgrade.
Many installations create their database clusters on file systems (volumes) other than the machine's “root” volume. If you choose to do this, it is not advisable to try to use the secondary volume's topmost directory (mount point) as the data directory. Best practice is to create a directory within the mount-point directory that is owned by the PostgreSQL user, and then create the data directory within that. This avoids permissions problems, particularly for operations such as pg_upgrade, and it also ensures clean failures if the secondary volume is taken offline.
Generally, any file system with POSIX semantics can be used for PostgreSQL. Users prefer different file systems for a variety of reasons, including vendor support, performance, and familiarity. Experience suggests that, all other things being equal, one should not expect major performance or behavior changes merely from switching file systems or making minor file system configuration changes.
It is possible to use an NFS file system for storing the PostgreSQL data directory. PostgreSQL does nothing special for NFS file systems, meaning it assumes NFS behaves exactly like locally-connected drives. PostgreSQL does not use any functionality that is known to have nonstandard behavior on NFS, such as file locking.
The only firm requirement for using NFS with
PostgreSQL is that the file system is mounted
using the hard
option. With the
hard
option, processes can “hang”
indefinitely if there are network problems, so this configuration will
require a careful monitoring setup. The soft
option
will interrupt system calls in case of network problems, but
PostgreSQL will not repeat system calls
interrupted in this way, so any such interruption will result in an I/O
error being reported.
It is not necessary to use the sync
mount option. The
behavior of the async
option is sufficient, since
PostgreSQL issues fsync
calls at appropriate times to flush the write caches. (This is analogous
to how it works on a local file system.) However, it is strongly
recommended to use the sync
export option on the NFS
server on systems where it exists (mainly Linux).
Otherwise, an fsync
or equivalent on the NFS client is
not actually guaranteed to reach permanent storage on the server, which
could cause corruption similar to running with the parameter fsync off. The defaults of these mount and export
options differ between vendors and versions, so it is recommended to
check and perhaps specify them explicitly in any case to avoid any
ambiguity.
In some cases, an external storage product can be accessed either via NFS or a lower-level protocol such as iSCSI. In the latter case, the storage appears as a block device and any available file system can be created on it. That approach might relieve the DBA from having to deal with some of the idiosyncrasies of NFS, but of course the complexity of managing remote storage then happens at other levels.
Before anyone can access the database, you must start the database
server. The database server program is called
postgres
.
If you are using a pre-packaged version of PostgreSQL, it almost certainly includes provisions for running the server as a background task according to the conventions of your operating system. Using the package's infrastructure to start the server will be much less work than figuring out how to do this yourself. Consult the package-level documentation for details.
The bare-bones way to start the server manually is just to invoke
postgres
directly, specifying the location of the
data directory with the -D
option, for example:
$ postgres -D /usr/local/pgsql/data
which will leave the server running in the foreground. This must be
done while logged into the PostgreSQL user
account. Without -D
, the server will try to use
the data directory named by the environment variable PGDATA
.
If that variable is not provided either, it will fail.
Normally it is better to start postgres
in the
background. For this, use the usual Unix shell syntax:
$ postgres -D /usr/local/pgsql/data >logfile 2>&1 &
It is important to store the server's stdout and stderr output somewhere, as shown above. It will help for auditing purposes and to diagnose problems. (See Section 25.3 for a more thorough discussion of log file handling.)
The postgres
program also takes a number of other
command-line options. For more information, see the
postgres reference page
and Chapter 20 below.
This shell syntax can get tedious quickly. Therefore the wrapper program pg_ctl is provided to simplify some tasks. For example:
pg_ctl start -l logfile
will start the server in the background and put the output into the
named log file. The -D
option has the same meaning
here as for postgres
. pg_ctl
is also capable of stopping the server.
Normally, you will want to start the database server when the
computer boots.
Autostart scripts are operating-system-specific.
There are a few example scripts distributed with
PostgreSQL in the
contrib/start-scripts
directory. Installing one will require
root privileges.
Different systems have different conventions for starting up daemons
at boot time. Many systems have a file
/etc/rc.local
or
/etc/rc.d/rc.local
. Others use init.d
or
rc.d
directories. Whatever you do, the server must be
run by the PostgreSQL user account
and not by root or any other user. Therefore you
probably should form your commands using
su postgres -c '...'
. For example:
su postgres -c 'pg_ctl start -D /usr/local/pgsql/data -l serverlog'
Here are a few more operating-system-specific suggestions. (In each case be sure to use the proper installation directory and user name where we show generic values.)
For FreeBSD, look at the file
contrib/start-scripts/freebsd
in the
PostgreSQL source distribution.
On OpenBSD, add the following lines
to the file /etc/rc.local
:
if [ -x /usr/local/pgsql/bin/pg_ctl -a -x /usr/local/pgsql/bin/postgres ]; then su -l postgres -c '/usr/local/pgsql/bin/pg_ctl start -s -l /var/postgresql/log -D /usr/local/pgsql/data' echo -n ' postgresql' fi
/usr/local/pgsql/bin/pg_ctl start -l logfile -D /usr/local/pgsql/data
to /etc/rc.d/rc.local
or /etc/rc.local
or look at the file
contrib/start-scripts/linux
in the
PostgreSQL source distribution.
When using systemd, you can use the following
service unit file (e.g.,
at /etc/systemd/system/postgresql.service
):
[Unit] Description=PostgreSQL database server Documentation=man:postgres(1) After=network-online.target Wants=network-online.target [Service] Type=notify User=postgres ExecStart=/usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data ExecReload=/bin/kill -HUP $MAINPID KillMode=mixed KillSignal=SIGINT TimeoutSec=infinity [Install] WantedBy=multi-user.target
Using Type=notify
requires that the server binary was
built with configure --with-systemd
.
Consider carefully the timeout
setting. systemd has a default timeout of 90
seconds as of this writing and will kill a process that does not report
readiness within that time. But a PostgreSQL
server that might have to perform crash recovery at startup could take
much longer to become ready. The suggested value
of infinity
disables the timeout logic.
On NetBSD, use either the FreeBSD or Linux start scripts, depending on preference.
On Solaris, create a file called
/etc/init.d/postgresql
that contains
the following line:
su - postgres -c "/usr/local/pgsql/bin/pg_ctl start -l logfile -D /usr/local/pgsql/data"
Then, create a symbolic link to it in /etc/rc3.d
as
S99postgresql
.
While the server is running, its
PID is stored in the file
postmaster.pid
in the data directory. This is
used to prevent multiple server instances from
running in the same data directory and can also be used for
shutting down the server.
There are several common reasons the server might fail to start. Check the server's log file, or start it by hand (without redirecting standard output or standard error) and see what error messages appear. Below we explain some of the most common error messages in more detail.
LOG: could not bind IPv4 address "127.0.0.1": Address already in use HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry. FATAL: could not create any TCP/IP sockets
This usually means just what it suggests: you tried to start
another server on the same port where one is already running.
However, if the kernel error message is not Address
already in use
or some variant of that, there might
be a different problem. For example, trying to start a server
on a reserved port number might draw something like:
$ postgres -p 666
LOG: could not bind IPv4 address "127.0.0.1": Permission denied
HINT: Is another postmaster already running on port 666? If not, wait a few seconds and retry.
FATAL: could not create any TCP/IP sockets
A message like:
FATAL: could not create shared memory segment: Invalid argument DETAIL: Failed system call was shmget(key=5440001, size=4011376640, 03600).
probably means your kernel's limit on the size of shared memory is
smaller than the work area PostgreSQL
is trying to create (4011376640 bytes in this example).
This is only likely to happen if you have set shared_memory_type
to sysv
. In that case, you
can try starting the server with a smaller-than-normal number of
buffers (shared_buffers), or
reconfigure your kernel to increase the allowed shared memory
size. You might also see this message when trying to start multiple
servers on the same machine, if their total space requested
exceeds the kernel limit.
An error like:
FATAL: could not create semaphores: No space left on device DETAIL: Failed system call was semget(5440126, 17, 03600).
does not mean you've run out of disk space. It means your kernel's limit on the number of System V semaphores is smaller than the number PostgreSQL wants to create. As above, you might be able to work around the problem by starting the server with a reduced number of allowed connections (max_connections), but you'll eventually want to increase the kernel limit.
Details about configuring System V IPC facilities are given in Section 19.4.1.
Although the error conditions possible on the client side are quite varied and application-dependent, a few of them might be directly related to how the server was started. Conditions other than those shown below should be documented with the respective client application.
psql: error: connection to server at "server.joe.com" (123.123.123.123), port 5432 failed: Connection refused Is the server running on that host and accepting TCP/IP connections?
This is the generic “I couldn't find a server to talk to” failure. It looks like the above when TCP/IP communication is attempted. A common mistake is to forget to configure the server to allow TCP/IP connections.
Alternatively, you might get this when attempting Unix-domain socket communication to a local server:
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or directory Is the server running locally and accepting connections on that socket?
If the server is indeed running, check that the client's idea of the
socket path (here /tmp
) agrees with the server's
unix_socket_directories setting.
A connection failure message always shows the server address or socket
path name, which is useful in verifying that the client is trying to
connect to the right place. If there is in fact no server
listening there, the kernel error message will typically be either
Connection refused
or
No such file or directory
, as
illustrated. (It is important to realize that
Connection refused
in this context
does not mean that the server got your
connection request and rejected it. That case will produce a
different message, as shown in Section 21.15.) Other error messages
such as Connection timed out
might
indicate more fundamental problems, like lack of network
connectivity, or a firewall blocking the connection.
PostgreSQL can sometimes exhaust various operating system resource limits, especially when multiple copies of the server are running on the same system, or in very large installations. This section explains the kernel resources used by PostgreSQL and the steps you can take to resolve problems related to kernel resource consumption.
PostgreSQL requires the operating system to provide inter-process communication (IPC) features, specifically shared memory and semaphores. Unix-derived systems typically provide “System V” IPC, “POSIX” IPC, or both. Windows has its own implementation of these features and is not discussed here.
By default, PostgreSQL allocates
a very small amount of System V shared memory, as well as a much larger
amount of anonymous mmap
shared memory.
Alternatively, a single large System V shared memory region can be used
(see shared_memory_type).
In addition a significant number of semaphores, which can be either
System V or POSIX style, are created at server startup. Currently,
POSIX semaphores are used on Linux and FreeBSD systems while other
platforms use System V semaphores.
System V IPC features are typically constrained by system-wide allocation limits. When PostgreSQL exceeds one of these limits, the server will refuse to start and should leave an instructive error message describing the problem and what to do about it. (See also Section 19.3.1.) The relevant kernel parameters are named consistently across different systems; Table 19.1 gives an overview. The methods to set them, however, vary. Suggestions for some platforms are given below.
Table 19.1. System V IPC Parameters
Name | Description | Values needed to run one PostgreSQL instance |
---|---|---|
SHMMAX | Maximum size of shared memory segment (bytes) | at least 1kB, but the default is usually much higher |
SHMMIN | Minimum size of shared memory segment (bytes) | 1 |
SHMALL | Total amount of shared memory available (bytes or pages) | same as SHMMAX if bytes,
or ceil(SHMMAX/PAGE_SIZE) if pages,
plus room for other applications |
SHMSEG | Maximum number of shared memory segments per process | only 1 segment is needed, but the default is much higher |
SHMMNI | Maximum number of shared memory segments system-wide | like SHMSEG plus room for other applications |
SEMMNI | Maximum number of semaphore identifiers (i.e., sets) | at least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 6) / 16) plus room for other applications |
SEMMNS | Maximum number of semaphores system-wide | ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 6) / 16) * 17 plus room for other applications |
SEMMSL | Maximum number of semaphores per set | at least 17 |
SEMMAP | Number of entries in semaphore map | see text |
SEMVMX | Maximum value of semaphore | at least 1000 (The default is often 32767; do not change unless necessary) |
PostgreSQL requires a few bytes of System V shared memory
(typically 48 bytes, on 64-bit platforms) for each copy of the server.
On most modern operating systems, this amount can easily be allocated.
However, if you are running many copies of the server or you explicitly
configure the server to use large amounts of System V shared memory (see
shared_memory_type and dynamic_shared_memory_type), it may be necessary to
increase SHMALL
, which is the total amount of System V shared
memory system-wide. Note that SHMALL
is measured in pages
rather than bytes on many systems.
Less likely to cause problems is the minimum size for shared
memory segments (SHMMIN
), which should be at most
approximately 32 bytes for PostgreSQL (it is
usually just 1). The maximum number of segments system-wide
(SHMMNI
) or per-process (SHMSEG
) are unlikely
to cause a problem unless your system has them set to zero.
When using System V semaphores,
PostgreSQL uses one semaphore per allowed connection
(max_connections), allowed autovacuum worker process
(autovacuum_max_workers), allowed WAL sender process
(max_wal_senders), and allowed background
process (max_worker_processes), in sets of 16.
Each such set will
also contain a 17th semaphore which contains a “magic
number”, to detect collision with semaphore sets used by
other applications. The maximum number of semaphores in the system
is set by SEMMNS
, which consequently must be at least
as high as max_connections
plus
autovacuum_max_workers
plus max_wal_senders
,
plus max_worker_processes
, plus one extra for each 16
allowed connections plus workers (see the formula in Table 19.1). The parameter SEMMNI
determines the limit on the number of semaphore sets that can
exist on the system at one time. Hence this parameter must be at
least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 6) / 16)
.
Lowering the number
of allowed connections is a temporary workaround for failures,
which are usually confusingly worded “No space
left on device”, from the function semget
.
In some cases it might also be necessary to increase
SEMMAP
to be at least on the order of
SEMMNS
. If the system has this parameter
(many do not), it defines the size of the semaphore
resource map, in which each contiguous block of available semaphores
needs an entry. When a semaphore set is freed it is either added to
an existing entry that is adjacent to the freed block or it is
registered under a new map entry. If the map is full, the freed
semaphores get lost (until reboot). Fragmentation of the semaphore
space could over time lead to fewer available semaphores than there
should be.
Various other settings related to “semaphore undo”, such as
SEMMNU
and SEMUME
, do not affect
PostgreSQL.
When using POSIX semaphores, the number of semaphores needed is the same as for System V, that is one semaphore per allowed connection (max_connections), allowed autovacuum worker process (autovacuum_max_workers), allowed WAL sender process (max_wal_senders), and allowed background process (max_worker_processes). On the platforms where this option is preferred, there is no specific kernel limit on the number of POSIX semaphores.
It should not be necessary to do
any special configuration for such parameters as
SHMMAX
, as it appears this is configured to
allow all memory to be used as shared memory. That is the
sort of configuration commonly used for other databases such
as DB/2.
It might, however, be necessary to modify the global
ulimit
information in
/etc/security/limits
, as the default hard
limits for file sizes (fsize
) and numbers of
files (nofiles
) might be too low.
The default shared memory settings are usually good enough, unless
you have set shared_memory_type
to sysv
.
System V semaphores are not used on this platform.
The default IPC settings can be changed using
the sysctl
or
loader
interfaces. The following
parameters can be set using sysctl
:
#
sysctl kern.ipc.shmall=32768
#
sysctl kern.ipc.shmmax=134217728
To make these settings persist over reboots, modify
/etc/sysctl.conf
.
If you have set shared_memory_type
to
sysv
, you might also want to configure your kernel
to lock System V shared memory into RAM and prevent it from being paged
out to swap. This can be accomplished using the sysctl
setting kern.ipc.shm_use_phys
.
If running in a FreeBSD jail, you should set its
sysvshm
parameter to new
, so that
it has its own separate System V shared memory namespace.
(Before FreeBSD 11.0, it was necessary to enable shared access to
the host's IPC namespace from jails, and take measures to avoid
collisions.)
The default shared memory settings are usually good enough, unless
you have set shared_memory_type
to sysv
.
You will usually want to increase kern.ipc.semmni
and kern.ipc.semmns
,
as NetBSD's default settings
for these are uncomfortably small.
IPC parameters can be adjusted using sysctl
,
for example:
#
sysctl -w kern.ipc.semmni=100
To make these settings persist over reboots, modify
/etc/sysctl.conf
.
If you have set shared_memory_type
to
sysv
, you might also want to configure your kernel
to lock System V shared memory into RAM and prevent it from being paged
out to swap. This can be accomplished using the sysctl
setting kern.ipc.shm_use_phys
.
The default shared memory settings are usually good enough, unless
you have set shared_memory_type
to sysv
.
You will usually want to
increase kern.seminfo.semmni
and kern.seminfo.semmns
,
as OpenBSD's default settings
for these are uncomfortably small.
IPC parameters can be adjusted using sysctl
,
for example:
#
sysctl kern.seminfo.semmni=100
To make these settings persist over reboots, modify
/etc/sysctl.conf
.
The default settings tend to suffice for normal installations.
IPC parameters can be set in the System Administration Manager (SAM) under → . Choose when you're done.
The default shared memory settings are usually good enough, unless
you have set shared_memory_type
to sysv
,
and even then only on older kernel versions that shipped with low defaults.
System V semaphores are not used on this platform.
The shared memory size settings can be changed via the
sysctl
interface. For example, to allow 16 GB:
$
sysctl -w kernel.shmmax=17179869184
$
sysctl -w kernel.shmall=4194304
To make these settings persist over reboots, see
/etc/sysctl.conf
.
The default shared memory and semaphore settings are usually good enough, unless
you have set shared_memory_type
to sysv
.
The recommended method for configuring shared memory in macOS
is to create a file named /etc/sysctl.conf
,
containing variable assignments such as:
kern.sysv.shmmax=4194304 kern.sysv.shmmin=1 kern.sysv.shmmni=32 kern.sysv.shmseg=8 kern.sysv.shmall=1024
Note that in some macOS versions,
all five shared-memory parameters must be set in
/etc/sysctl.conf
, else the values will be ignored.
SHMMAX
can only be set to a multiple of 4096.
SHMALL
is measured in 4 kB pages on this platform.
It is possible to change all but SHMMNI
on the fly, using
sysctl. But it's still best to set up your preferred
values via /etc/sysctl.conf
, so that the values will be
kept across reboots.
The default shared memory and semaphore settings are usually good enough for most
PostgreSQL applications. Solaris defaults
to a SHMMAX
of one-quarter of system RAM.
To further adjust this setting, use a project setting associated
with the postgres
user. For example, run the
following as root
:
projadd -c "PostgreSQL DB User" -K "project.max-shm-memory=(privileged,8GB,deny)" -U postgres -G postgres user.postgres
This command adds the user.postgres
project and
sets the shared memory maximum for the postgres
user to 8GB, and takes effect the next time that user logs
in, or when you restart PostgreSQL (not reload).
The above assumes that PostgreSQL is run by
the postgres
user in the postgres
group. No server reboot is required.
Other recommended kernel setting changes for database servers which will have a large number of connections are:
project.max-shm-ids=(priv,32768,deny) project.max-sem-ids=(priv,4096,deny) project.max-msg-ids=(priv,4096,deny)
Additionally, if you are running PostgreSQL
inside a zone, you may need to raise the zone resource usage
limits as well. See "Chapter2: Projects and Tasks" in the
System Administrator's Guide for more
information on projects
and prctl
.
If systemd is in use, some care must be taken
that IPC resources (including shared memory) are not prematurely
removed by the operating system. This is especially of concern when
installing PostgreSQL from source. Users of distribution packages of
PostgreSQL are less likely to be affected, as
the postgres
user is then normally created as a system
user.
The setting RemoveIPC
in logind.conf
controls whether IPC objects are
removed when a user fully logs out. System users are exempt. This
setting defaults to on in stock systemd, but
some operating system distributions default it to off.
A typical observed effect when this setting is on is that shared memory objects used for parallel query execution are removed at apparently random times, leading to errors and warnings while attempting to open and remove them, like
WARNING: could not remove shared memory segment "/PostgreSQL.1450751626": No such file or directory
Different types of IPC objects (shared memory vs. semaphores, System V vs. POSIX) are treated slightly differently by systemd, so one might observe that some IPC resources are not removed in the same way as others. But it is not advisable to rely on these subtle differences.
A “user logging out” might happen as part of a maintenance
job or manually when an administrator logs in as
the postgres
user or something similar, so it is hard
to prevent in general.
What is a “system user” is determined
at systemd compile time from
the SYS_UID_MAX
setting
in /etc/login.defs
.
Packaging and deployment scripts should be careful to create
the postgres
user as a system user by
using useradd -r
, adduser --system
,
or equivalent.
Alternatively, if the user account was created incorrectly or cannot be changed, it is recommended to set
RemoveIPC=no
in /etc/systemd/logind.conf
or another appropriate
configuration file.
At least one of these two things has to be ensured, or the PostgreSQL server will be very unreliable.
Unix-like operating systems enforce various kinds of resource limits
that might interfere with the operation of your
PostgreSQL server. Of particular
importance are limits on the number of processes per user, the
number of open files per process, and the amount of memory available
to each process. Each of these have a “hard” and a
“soft” limit. The soft limit is what actually counts
but it can be changed by the user up to the hard limit. The hard
limit can only be changed by the root user. The system call
setrlimit
is responsible for setting these
parameters. The shell's built-in command ulimit
(Bourne shells) or limit
(csh) is
used to control the resource limits from the command line. On
BSD-derived systems the file /etc/login.conf
controls the various resource limits set during login. See the
operating system documentation for details. The relevant
parameters are maxproc
,
openfiles
, and datasize
. For
example:
default:\ ... :datasize-cur=256M:\ :maxproc-cur=256:\ :openfiles-cur=256:\ ...
(-cur
is the soft limit. Append
-max
to set the hard limit.)
Kernels can also have system-wide limits on some resources.
On Linux the kernel parameter
fs.file-max
determines the maximum number of open
files that the kernel will support. It can be changed with
sysctl -w fs.file-max=
.
To make the setting persist across reboots, add an assignment
in N
/etc/sysctl.conf
.
The maximum limit of files per process is fixed at the time the
kernel is compiled; see
/usr/src/linux/Documentation/proc.txt
for
more information.
The PostgreSQL server uses one process per connection so you should provide for at least as many processes as allowed connections, in addition to what you need for the rest of your system. This is usually not a problem but if you run several servers on one machine things might get tight.
The factory default limit on open files is often set to “socially friendly” values that allow many users to coexist on a machine without using an inappropriate fraction of the system resources. If you run many servers on a machine this is perhaps what you want, but on dedicated servers you might want to raise this limit.
On the other side of the coin, some systems allow individual processes to open large numbers of files; if more than a few processes do so then the system-wide limit can easily be exceeded. If you find this happening, and you do not want to alter the system-wide limit, you can set PostgreSQL's max_files_per_process configuration parameter to limit the consumption of open files.
Another kernel limit that may be of concern when supporting large
numbers of client connections is the maximum socket connection queue
length. If more than that many connection requests arrive within a very
short period, some may get rejected before the postmaster can service
the requests, with those clients receiving unhelpful connection failure
errors such as “Resource temporarily unavailable” or
“Connection refused”. The default queue length limit is 128
on many platforms. To raise it, adjust the appropriate kernel parameter
via sysctl, then restart the postmaster.
The parameter is variously named net.core.somaxconn
on Linux, kern.ipc.soacceptqueue
on newer FreeBSD,
and kern.ipc.somaxconn
on macOS and other BSD
variants.
The default virtual memory behavior on Linux is not optimal for PostgreSQL. Because of the way that the kernel implements memory overcommit, the kernel might terminate the PostgreSQL postmaster (the supervisor server process) if the memory demands of either PostgreSQL or another process cause the system to run out of virtual memory.
If this happens, you will see a kernel message that looks like this (consult your system documentation and configuration on where to look for such a message):
Out of Memory: Killed process 12345 (postgres).
This indicates that the postgres
process
has been terminated due to memory pressure.
Although existing database connections will continue to function
normally, no new connections will be accepted. To recover,
PostgreSQL will need to be restarted.
One way to avoid this problem is to run PostgreSQL on a machine where you can be sure that other processes will not run the machine out of memory. If memory is tight, increasing the swap space of the operating system can help avoid the problem, because the out-of-memory (OOM) killer is invoked only when physical memory and swap space are exhausted.
If PostgreSQL itself is the cause of the
system running out of memory, you can avoid the problem by changing
your configuration. In some cases, it may help to lower memory-related
configuration parameters, particularly
shared_buffers
,
work_mem
, and
hash_mem_multiplier
.
In other cases, the problem may be caused by allowing too many
connections to the database server itself. In many cases, it may
be better to reduce
max_connections
and instead make use of external connection-pooling software.
It is possible to modify the
kernel's behavior so that it will not “overcommit” memory.
Although this setting will not prevent the OOM killer from being invoked
altogether, it will lower the chances significantly and will therefore
lead to more robust system behavior. This is done by selecting strict
overcommit mode via sysctl
:
sysctl -w vm.overcommit_memory=2
or placing an equivalent entry in /etc/sysctl.conf
.
You might also wish to modify the related setting
vm.overcommit_ratio
. For details see the kernel documentation
file https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
Another approach, which can be used with or without altering
vm.overcommit_memory
, is to set the process-specific
OOM score adjustment value for the postmaster process to
-1000
, thereby guaranteeing it will not be targeted by the OOM
killer. The simplest way to do this is to execute
echo -1000 > /proc/self/oom_score_adj
in the postmaster's startup script just before invoking the postmaster. Note that this action must be done as root, or it will have no effect; so a root-owned startup script is the easiest place to do it. If you do this, you should also set these environment variables in the startup script before invoking the postmaster:
export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj export PG_OOM_ADJUST_VALUE=0
These settings will cause postmaster child processes to run with the
normal OOM score adjustment of zero, so that the OOM killer can still
target them at need. You could use some other value for
PG_OOM_ADJUST_VALUE
if you want the child processes to run
with some other OOM score adjustment. (PG_OOM_ADJUST_VALUE
can also be omitted, in which case it defaults to zero.) If you do not
set PG_OOM_ADJUST_FILE
, the child processes will run with the
same OOM score adjustment as the postmaster, which is unwise since the
whole point is to ensure that the postmaster has a preferential setting.
Using huge pages reduces overhead when using large contiguous chunks of
memory, as PostgreSQL does, particularly when
using large values of shared_buffers. To use this
feature in PostgreSQL you need a kernel
with CONFIG_HUGETLBFS=y
and
CONFIG_HUGETLB_PAGE=y
. You will also have to configure
the operating system to provide enough huge pages of the desired size.
To estimate the number of huge pages needed, start
PostgreSQL without huge pages enabled and check
the postmaster's anonymous shared memory segment size, as well as the
system's default and supported huge page sizes, using the
/proc
and /sys
file systems.
This might look like:
$head -1 $PGDATA/postmaster.pid
4170 $pmap 4170 | awk '/rw-s/ && /zero/ {print $2}'
6490428K $grep ^Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB $ls /sys/kernel/mm/hugepages
hugepages-1048576kB hugepages-2048kB
In this example the default is 2MB, but you can also explicitly request
either 2MB or 1GB with huge_page_size.
Assuming 2MB
huge pages,
6490428
/ 2048
gives approximately
3169.154
, so in this example we need at
least 3170
huge pages. A larger setting would be
appropriate if other programs on the machine also need huge pages.
We can set this with:
# sysctl -w vm.nr_hugepages=3170
Don't forget to add this setting to /etc/sysctl.conf
so that it is reapplied after reboots. For non-default huge page sizes,
we can instead use:
# echo 3170 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
It is also possible to provide these settings at boot time using
kernel parameters such as hugepagesz=2M hugepages=3170
.
Sometimes the kernel is not able to allocate the desired number of huge pages immediately due to fragmentation, so it might be necessary to repeat the command or to reboot. (Immediately after a reboot, most of the machine's memory should be available to convert into huge pages.) To verify the huge page allocation situation for a given size, use:
$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
It may also be necessary to give the database server's operating system
user permission to use huge pages by setting
vm.hugetlb_shm_group
via sysctl, and/or
give permission to lock memory with ulimit -l
.
The default behavior for huge pages in
PostgreSQL is to use them when possible, with
the system's default huge page size, and
to fall back to normal pages on failure. To enforce the use of huge
pages, you can set huge_pages
to on
in postgresql.conf
.
Note that with this setting PostgreSQL will fail to
start if not enough huge pages are available.
For a detailed description of the Linux huge pages feature have a look at https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt.
There are several ways to shut down the database server.
Under the hood, they all reduce to sending a signal to the supervisor
postgres
process.
If you are using a pre-packaged version of PostgreSQL, and you used its provisions for starting the server, then you should also use its provisions for stopping the server. Consult the package-level documentation for details.
When managing the server directly, you can control the type of shutdown
by sending different signals to the postgres
process:
This is the Smart Shutdown mode. After receiving SIGTERM, the server disallows new connections, but lets existing sessions end their work normally. It shuts down only after all of the sessions terminate. If the server is in online backup mode, it additionally waits until online backup mode is no longer active. While backup mode is active, new connections will still be allowed, but only to superusers (this exception allows a superuser to connect to terminate online backup mode). If the server is in recovery when a smart shutdown is requested, recovery and streaming replication will be stopped only after all regular sessions have terminated.
This is the Fast Shutdown mode. The server disallows new connections and sends all existing server processes SIGTERM, which will cause them to abort their current transactions and exit promptly. It then waits for all server processes to exit and finally shuts down. If the server is in online backup mode, backup mode will be terminated, rendering the backup useless.
This is the Immediate Shutdown mode. The server will send SIGQUIT to all child processes and wait for them to terminate. If any do not terminate within 5 seconds, they will be sent SIGKILL. The supervisor server process exits as soon as all child processes have exited, without doing normal database shutdown processing. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies.
The pg_ctl program provides a convenient
interface for sending these signals to shut down the server.
Alternatively, you can send the signal directly using kill
on non-Windows systems.
The PID of the postgres
process can be
found using the ps
program, or from the file
postmaster.pid
in the data directory. For
example, to do a fast shutdown:
$ kill -INT `head -1 /usr/local/pgsql/data/postmaster.pid`
It is best not to use SIGKILL to shut down the
server. Doing so will prevent the server from releasing shared memory and
semaphores. Furthermore, SIGKILL kills
the postgres
process without letting it relay the
signal to its subprocesses, so it might be necessary to kill the
individual subprocesses by hand as well.
To terminate an individual session while allowing other sessions to
continue, use pg_terminate_backend()
(see Table 9.86) or send a
SIGTERM signal to the child process associated with
the session.
This section discusses how to upgrade your database data from one PostgreSQL release to a newer one.
Current PostgreSQL version numbers consist of a major and a minor version number. For example, in the version number 10.1, the 10 is the major version number and the 1 is the minor version number, meaning this would be the first minor release of the major release 10. For releases before PostgreSQL version 10.0, version numbers consist of three numbers, for example, 9.5.3. In those cases, the major version consists of the first two digit groups of the version number, e.g., 9.5, and the minor version is the third number, e.g., 3, meaning this would be the third minor release of the major release 9.5.
Minor releases never change the internal storage format and are always compatible with earlier and later minor releases of the same major version number. For example, version 10.1 is compatible with version 10.0 and version 10.6. Similarly, for example, 9.5.3 is compatible with 9.5.0, 9.5.1, and 9.5.6. To update between compatible versions, you simply replace the executables while the server is down and restart the server. The data directory remains unchanged — minor upgrades are that simple.
For major releases of PostgreSQL, the internal data storage format is subject to change, thus complicating upgrades. The traditional method for moving data to a new major version is to dump and restore the database, though this can be slow. A faster method is pg_upgrade. Replication methods are also available, as discussed below. (If you are using a pre-packaged version of PostgreSQL, it may provide scripts to assist with major version upgrades. Consult the package-level documentation for details.)
New major versions also typically introduce some user-visible incompatibilities, so application programming changes might be required. All user-visible changes are listed in the release notes (Appendix E); pay particular attention to the section labeled "Migration". Though you can upgrade from one major version to another without upgrading to intervening versions, you should read the major release notes of all intervening versions.
Cautious users will want to test their client applications on the new version before switching over fully; therefore, it's often a good idea to set up concurrent installations of old and new versions. When testing a PostgreSQL major upgrade, consider the following categories of possible changes:
The capabilities available for administrators to monitor and control the server often change and improve in each major release.
Typically this includes new SQL command capabilities and not changes in behavior, unless specifically mentioned in the release notes.
Typically libraries like libpq only add new functionality, again unless mentioned in the release notes.
System catalog changes usually only affect database management tools.
This involves changes in the backend function API, which is written in the C programming language. Such changes affect code that references backend functions deep inside the server.
One upgrade method is to dump data from one major version of PostgreSQL and restore it in another — to do this, you must use a logical backup tool like pg_dumpall; file system level backup methods will not work. (There are checks in place that prevent you from using a data directory with an incompatible version of PostgreSQL, so no great harm can be done by trying to start the wrong server version on a data directory.)
It is recommended that you use the pg_dump and pg_dumpall programs from the newer version of PostgreSQL, to take advantage of enhancements that might have been made in these programs. Current releases of the dump programs can read data from any server version back to 8.0.
These instructions assume that your existing installation is under the
/usr/local/pgsql
directory, and that the data area is in
/usr/local/pgsql/data
. Substitute your paths
appropriately.
If making a backup, make sure that your database is not being updated.
This does not affect the integrity of the backup, but the changed
data would of course not be included. If necessary, edit the
permissions in the file /usr/local/pgsql/data/pg_hba.conf
(or equivalent) to disallow access from everyone except you.
See Chapter 21 for additional information on
access control.
To back up your database installation, type:
pg_dumpall > outputfile
To make the backup, you can use the pg_dumpall command from the version you are currently running; see Section 26.1.2 for more details. For best results, however, try to use the pg_dumpall command from PostgreSQL 14.13, since this version contains bug fixes and improvements over older versions. While this advice might seem idiosyncratic since you haven't installed the new version yet, it is advisable to follow it if you plan to install the new version in parallel with the old version. In that case you can complete the installation normally and transfer the data later. This will also decrease the downtime.
Shut down the old server:
pg_ctl stop
On systems that have PostgreSQL started at boot time, there is probably a start-up file that will accomplish the same thing. For example, on a Red Hat Linux system one might find that this works:
/etc/rc.d/init.d/postgresql stop
See Chapter 19 for details about starting and stopping the server.
If restoring from backup, rename or delete the old installation directory if it is not version-specific. It is a good idea to rename the directory, rather than delete it, in case you have trouble and need to revert to it. Keep in mind the directory might consume significant disk space. To rename the directory, use a command like this:
mv /usr/local/pgsql /usr/local/pgsql.old
(Be sure to move the directory as a single unit so relative paths remain unchanged.)
Install the new version of PostgreSQL as outlined in Section 17.4.
Create a new database cluster if needed. Remember that you must execute these commands while logged in to the special database user account (which you already have if you are upgrading).
/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
Restore your previous pg_hba.conf
and any
postgresql.conf
modifications.
Start the database server, again using the special database user account:
/usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
Finally, restore your data from backup with:
/usr/local/pgsql/bin/psql -d postgres -f outputfile
using the new psql.
The least downtime can be achieved by installing the new server in a different directory and running both the old and the new servers in parallel, on different ports. Then you can use something like:
pg_dumpall -p 5432 | psql -d postgres -p 5433
to transfer your data.
The pg_upgrade module allows an installation to
be migrated in-place from one major PostgreSQL
version to another. Upgrades can be performed in minutes,
particularly with --link
mode. It requires steps similar to
pg_dumpall above, e.g., starting/stopping the server,
running initdb. The pg_upgrade documentation outlines the necessary steps.
It is also possible to use logical replication methods to create a standby server with the updated version of PostgreSQL. This is possible because logical replication supports replication between different major versions of PostgreSQL. The standby can be on the same computer or a different computer. Once it has synced up with the primary server (running the older version of PostgreSQL), you can switch primaries and make the standby the primary and shut down the older database instance. Such a switch-over results in only several seconds of downtime for an upgrade.
This method of upgrading can be performed using the built-in logical replication facilities as well as using external logical replication systems such as pglogical, Slony, Londiste, and Bucardo.
While the server is running, it is not possible for a malicious user
to take the place of the normal database server. However, when the
server is down, it is possible for a local user to spoof the normal
server by starting their own server. The spoof server could read
passwords and queries sent by clients, but could not return any data
because the PGDATA
directory would still be secure because
of directory permissions. Spoofing is possible because any user can
start a database server; a client cannot identify an invalid server
unless it is specially configured.
One way to prevent spoofing of local
connections is to use a Unix domain socket directory (unix_socket_directories) that has write permission only
for a trusted local user. This prevents a malicious user from creating
their own socket file in that directory. If you are concerned that
some applications might still reference /tmp
for the
socket file and hence be vulnerable to spoofing, during operating system
startup create a symbolic link /tmp/.s.PGSQL.5432
that points
to the relocated socket file. You also might need to modify your
/tmp
cleanup script to prevent removal of the symbolic link.
Another option for local
connections is for clients to use
requirepeer
to specify the required owner of the server process connected to
the socket.
To prevent spoofing on TCP connections, either use SSL certificates and make sure that clients check the server's certificate, or use GSSAPI encryption (or both, if they're on separate connections).
To prevent spoofing with SSL, the server
must be configured to accept only hostssl
connections (Section 21.1) and have SSL key and certificate files
(Section 19.9). The TCP client must connect using
sslmode=verify-ca
or
verify-full
and have the appropriate root certificate
file installed (Section 34.19.1).
To prevent spoofing with GSSAPI, the server must be configured to accept
only hostgssenc
connections
(Section 21.1) and use gss
authentication with them. The TCP client must connect
using gssencmode=require
.
PostgreSQL offers encryption at several levels, and provides flexibility in protecting data from disclosure due to database server theft, unscrupulous administrators, and insecure networks. Encryption might also be required to secure sensitive data such as medical records or financial transactions.
Database user passwords are stored as hashes (determined by the setting password_encryption), so the administrator cannot determine the actual password assigned to the user. If SCRAM or MD5 encryption is used for client authentication, the unencrypted password is never even temporarily present on the server because the client encrypts it before being sent across the network. SCRAM is preferred, because it is an Internet standard and is more secure than the PostgreSQL-specific MD5 authentication protocol.
The pgcrypto module allows certain fields to be stored encrypted. This is useful if only some of the data is sensitive. The client supplies the decryption key and the data is decrypted on the server and then sent to the client.
The decrypted data and the decryption key are present on the server for a brief time while it is being decrypted and communicated between the client and server. This presents a brief moment where the data and keys can be intercepted by someone with complete access to the database server, such as the system administrator.
Storage encryption can be performed at the file system level or the block level. Linux file system encryption options include eCryptfs and EncFS, while FreeBSD uses PEFS. Block level or full disk encryption options include dm-crypt + LUKS on Linux and GEOM modules geli and gbde on FreeBSD. Many other operating systems support this functionality, including Windows.
This mechanism prevents unencrypted data from being read from the drives if the drives or the entire computer is stolen. This does not protect against attacks while the file system is mounted, because when mounted, the operating system provides an unencrypted view of the data. However, to mount the file system, you need some way for the encryption key to be passed to the operating system, and sometimes the key is stored somewhere on the host that mounts the disk.
SSL connections encrypt all data sent across the network: the
password, the queries, and the data returned. The
pg_hba.conf
file allows administrators to specify
which hosts can use non-encrypted connections (host
)
and which require SSL-encrypted connections
(hostssl
). Also, clients can specify that they
connect to servers only via SSL.
GSSAPI-encrypted connections encrypt all data sent across the network,
including queries and data returned. (No password is sent across the
network.) The pg_hba.conf
file allows
administrators to specify which hosts can use non-encrypted connections
(host
) and which require GSSAPI-encrypted connections
(hostgssenc
). Also, clients can specify that they
connect to servers only on GSSAPI-encrypted connections
(gssencmode=require
).
Stunnel or SSH can also be used to encrypt transmissions.
It is possible for both the client and server to provide SSL certificates to each other. It takes some extra configuration on each side, but this provides stronger verification of identity than the mere use of passwords. It prevents a computer from pretending to be the server just long enough to read the password sent by the client. It also helps prevent “man in the middle” attacks where a computer between the client and server pretends to be the server and reads and passes all data between the client and server.
If the system administrator for the server's machine cannot be trusted, it is necessary for the client to encrypt the data; this way, unencrypted data never appears on the database server. Data is encrypted on the client before being sent to the server, and database results have to be decrypted on the client before being used.
PostgreSQL has native support for using SSL connections to encrypt client/server communications for increased security. This requires that OpenSSL is installed on both client and server systems and that support in PostgreSQL is enabled at build time (see Chapter 17).
With SSL support compiled in, the
PostgreSQL server can be started with
SSL enabled by setting the parameter
ssl to on
in
postgresql.conf
. The server will listen for both normal
and SSL connections on the same TCP port, and will negotiate
with any connecting client on whether to use SSL. By
default, this is at the client's option; see Section 21.1 about how to set up the server to require
use of SSL for some or all connections.
To start in SSL mode, files containing the server certificate
and private key must exist. By default, these files are expected to be
named server.crt
and server.key
, respectively, in
the server's data directory, but other names and locations can be specified
using the configuration parameters ssl_cert_file
and ssl_key_file.
On Unix systems, the permissions on server.key
must
disallow any access to world or group; achieve this by the command
chmod 0600 server.key
. Alternatively, the file can be
owned by root and have group read access (that is, 0640
permissions). That setup is intended for installations where certificate
and key files are managed by the operating system. The user under which
the PostgreSQL server runs should then be made a
member of the group that has access to those certificate and key files.
If the data directory allows group read access then certificate files may need to be located outside of the data directory in order to conform to the security requirements outlined above. Generally, group access is enabled to allow an unprivileged user to backup the database, and in that case the backup software will not be able to read the certificate files and will likely error.
If the private key is protected with a passphrase, the server will prompt for the passphrase and will not start until it has been entered. Using a passphrase by default disables the ability to change the server's SSL configuration without a server restart, but see ssl_passphrase_command_supports_reload. Furthermore, passphrase-protected private keys cannot be used at all on Windows.
The first certificate in server.crt
must be the
server's certificate because it must match the server's private key.
The certificates of “intermediate” certificate authorities
can also be appended to the file. Doing this avoids the necessity of
storing intermediate certificates on clients, assuming the root and
intermediate certificates were created with v3_ca
extensions. (This sets the certificate's basic constraint of
CA
to true
.)
This allows easier expiration of intermediate certificates.
It is not necessary to add the root certificate to
server.crt
. Instead, clients must have the root
certificate of the server's certificate chain.
PostgreSQL reads the system-wide
OpenSSL configuration file. By default, this
file is named openssl.cnf
and is located in the
directory reported by openssl version -d
.
This default can be overridden by setting environment variable
OPENSSL_CONF
to the name of the desired configuration file.
OpenSSL supports a wide range of ciphers
and authentication algorithms, of varying strength. While a list of
ciphers can be specified in the OpenSSL
configuration file, you can specify ciphers specifically for use by
the database server by modifying ssl_ciphers in
postgresql.conf
.
It is possible to have authentication without encryption overhead by
using NULL-SHA
or NULL-MD5
ciphers. However,
a man-in-the-middle could read and pass communications between client
and server. Also, encryption overhead is minimal compared to the
overhead of authentication. For these reasons NULL ciphers are not
recommended.
To require the client to supply a trusted certificate,
place certificates of the root certificate authorities
(CAs) you trust in a file in the data
directory, set the parameter ssl_ca_file in
postgresql.conf
to the new file name, and add the
authentication option clientcert=verify-ca
or
clientcert=verify-full
to the appropriate
hostssl
line(s) in pg_hba.conf
.
A certificate will then be requested from the client during SSL
connection startup. (See Section 34.19 for a description
of how to set up certificates on the client.)
For a hostssl
entry with
clientcert=verify-ca
, the server will verify
that the client's certificate is signed by one of the trusted
certificate authorities. If clientcert=verify-full
is specified, the server will not only verify the certificate
chain, but it will also check whether the username or its mapping
matches the cn
(Common Name) of the provided certificate.
Note that certificate chain validation is always ensured when the
cert
authentication method is used
(see Section 21.12).
Intermediate certificates that chain up to existing root certificates
can also appear in the ssl_ca_file file if
you wish to avoid storing them on clients (assuming the root and
intermediate certificates were created with v3_ca
extensions). Certificate Revocation List (CRL) entries are also
checked if the parameter ssl_crl_file or
ssl_crl_dir is set.
The clientcert
authentication option is available for
all authentication methods, but only in pg_hba.conf
lines
specified as hostssl
. When clientcert
is
not specified, the server verifies the client certificate against its CA
file only if a client certificate is presented and the CA is configured.
There are two approaches to enforce that users provide a certificate during login.
The first approach makes use of the cert
authentication
method for hostssl
entries in pg_hba.conf
,
such that the certificate itself is used for authentication while also
providing ssl connection security. See Section 21.12 for details.
(It is not necessary to specify any clientcert
options
explicitly when using the cert
authentication method.)
In this case, the cn
(Common Name) provided in
the certificate is checked against the user name or an applicable mapping.
The second approach combines any authentication method for hostssl
entries with the verification of client certificates by setting the
clientcert
authentication option to verify-ca
or verify-full
. The former option only enforces that
the certificate is valid, while the latter also ensures that the
cn
(Common Name) in the certificate matches
the user name or an applicable mapping.
Table 19.2 summarizes the files that are relevant to the SSL setup on the server. (The shown file names are default names. The locally configured names could be different.)
Table 19.2. SSL Server File Usage
File | Contents | Effect |
---|---|---|
ssl_cert_file ($PGDATA/server.crt ) | server certificate | sent to client to indicate server's identity |
ssl_key_file ($PGDATA/server.key ) | server private key | proves server certificate was sent by the owner; does not indicate certificate owner is trustworthy |
ssl_ca_file | trusted certificate authorities | checks that client certificate is signed by a trusted certificate authority |
ssl_crl_file | certificates revoked by certificate authorities | client certificate must not be on this list |
The server reads these files at server start and whenever the server configuration is reloaded. On Windows systems, they are also re-read whenever a new backend process is spawned for a new client connection.
If an error in these files is detected at server start, the server will refuse to start. But if an error is detected during a configuration reload, the files are ignored and the old SSL configuration continues to be used. On Windows systems, if an error in these files is detected at backend start, that backend will be unable to establish an SSL connection. In all these cases, the error condition is reported in the server log.
To create a simple self-signed certificate for the server, valid for 365
days, use the following OpenSSL command,
replacing dbhost.yourdomain.com
with the
server's host name:
openssl req -new -x509 -days 365 -nodes -text -out server.crt \
-keyout server.key -subj "/CN=dbhost.yourdomain.com
"
Then do:
chmod og-rwx server.key
because the server will reject the file if its permissions are more liberal than this. For more details on how to create your server private key and certificate, refer to the OpenSSL documentation.
While a self-signed certificate can be used for testing, a certificate signed by a certificate authority (CA) (usually an enterprise-wide root CA) should be used in production.
To create a server certificate whose identity can be validated by clients, first create a certificate signing request (CSR) and a public/private key file:
openssl req -new -nodes -text -out root.csr \
-keyout root.key -subj "/CN=root.yourdomain.com
"
chmod og-rwx root.key
Then, sign the request with the key to create a root certificate authority (using the default OpenSSL configuration file location on Linux):
openssl x509 -req -in root.csr -text -days 3650 \ -extfile /etc/ssl/openssl.cnf -extensions v3_ca \ -signkey root.key -out root.crt
Finally, create a server certificate signed by the new root certificate authority:
openssl req -new -nodes -text -out server.csr \
-keyout server.key -subj "/CN=dbhost.yourdomain.com
"
chmod og-rwx server.key
openssl x509 -req -in server.csr -text -days 365 \
-CA root.crt -CAkey root.key -CAcreateserial \
-out server.crt
server.crt
and server.key
should be stored on the server, and root.crt
should
be stored on the client so the client can verify that the server's leaf
certificate was signed by its trusted root certificate.
root.key
should be stored offline for use in
creating future certificates.
It is also possible to create a chain of trust that includes intermediate certificates:
# root openssl req -new -nodes -text -out root.csr \ -keyout root.key -subj "/CN=root.yourdomain.com
" chmod og-rwx root.key openssl x509 -req -in root.csr -text -days 3650 \ -extfile /etc/ssl/openssl.cnf -extensions v3_ca \ -signkey root.key -out root.crt # intermediate openssl req -new -nodes -text -out intermediate.csr \ -keyout intermediate.key -subj "/CN=intermediate.yourdomain.com
" chmod og-rwx intermediate.key openssl x509 -req -in intermediate.csr -text -days 1825 \ -extfile /etc/ssl/openssl.cnf -extensions v3_ca \ -CA root.crt -CAkey root.key -CAcreateserial \ -out intermediate.crt # leaf openssl req -new -nodes -text -out server.csr \ -keyout server.key -subj "/CN=dbhost.yourdomain.com
" chmod og-rwx server.key openssl x509 -req -in server.csr -text -days 365 \ -CA intermediate.crt -CAkey intermediate.key -CAcreateserial \ -out server.crt
server.crt
and
intermediate.crt
should be concatenated
into a certificate file bundle and stored on the server.
server.key
should also be stored on the server.
root.crt
should be stored on the client so
the client can verify that the server's leaf certificate was signed
by a chain of certificates linked to its trusted root certificate.
root.key
and intermediate.key
should be stored offline for use in creating future certificates.
PostgreSQL also has native support for using GSSAPI to encrypt client/server communications for increased security. Support requires that a GSSAPI implementation (such as MIT Kerberos) is installed on both client and server systems, and that support in PostgreSQL is enabled at build time (see Chapter 17).
The PostgreSQL server will listen for both normal and GSSAPI-encrypted connections on the same TCP port, and will negotiate with any connecting client whether to use GSSAPI for encryption (and for authentication). By default, this decision is up to the client (which means it can be downgraded by an attacker); see Section 21.1 about setting up the server to require the use of GSSAPI for some or all connections.
When using GSSAPI for encryption, it is common to use GSSAPI for authentication as well, since the underlying mechanism will determine both client and server identities (according to the GSSAPI implementation) in any case. But this is not required; another PostgreSQL authentication method can be chosen to perform additional verification.
Other than configuration of the negotiation behavior, GSSAPI encryption requires no setup beyond that which is necessary for GSSAPI authentication. (For more information on configuring that, see Section 21.6.)
It is possible to use SSH to encrypt the network connection between clients and a PostgreSQL server. Done properly, this provides an adequately secure network connection, even for non-SSL-capable clients.
First make sure that an SSH server is
running properly on the same machine as the
PostgreSQL server and that you can log in using
ssh
as some user; you then can establish a
secure tunnel to the remote server. A secure tunnel listens on a
local port and forwards all traffic to a port on the remote machine.
Traffic sent to the remote port can arrive on its
localhost
address, or different bind
address if desired; it does not appear as coming from your
local machine. This command creates a secure tunnel from the client
machine to the remote machine foo.com
:
ssh -L 63333:localhost:5432 joe@foo.com
The first number in the -L
argument, 63333, is the
local port number of the tunnel; it can be any unused port. (IANA
reserves ports 49152 through 65535 for private use.) The name or IP
address after this is the remote bind address you are connecting to,
i.e., localhost
, which is the default. The second
number, 5432, is the remote end of the tunnel, e.g., the port number
your database server is using. In order to connect to the database
server using this tunnel, you connect to port 63333 on the local
machine:
psql -h localhost -p 63333 postgres
To the database server it will then look as though you are
user joe
on host foo.com
connecting to the localhost
bind address, and it
will use whatever authentication procedure was configured for
connections by that user to that bind address. Note that the server will not
think the connection is SSL-encrypted, since in fact it is not
encrypted between the
SSH server and the
PostgreSQL server. This should not pose any
extra security risk because they are on the same machine.
In order for the
tunnel setup to succeed you must be allowed to connect via
ssh
as joe@foo.com
, just
as if you had attempted to use ssh
to create a
terminal session.
You could also have set up port forwarding as
ssh -L 63333:foo.com:5432 joe@foo.com
but then the database server will see the connection as coming in
on its foo.com
bind address, which is not opened by
the default setting listen_addresses =
'localhost'
. This is usually not what you want.
If you have to “hop” to the database server via some login host, one possible setup could look like this:
ssh -L 63333:db.foo.com:5432 joe@shell.foo.com
Note that this way the connection
from shell.foo.com
to db.foo.com
will not be encrypted by the SSH
tunnel.
SSH offers quite a few configuration possibilities when the network
is restricted in various ways. Please refer to the SSH
documentation for details.
Several other applications exist that can provide secure tunnels using a procedure similar in concept to the one just described.
To register a Windows event log library with the operating system, issue this command:
regsvr32 pgsql_library_directory
/pgevent.dll
This creates registry entries used by the event viewer, under the default
event source named PostgreSQL
.
To specify a different event source name (see
event_source), use the /n
and /i
options:
regsvr32 /n /i:event_source_name
pgsql_library_directory
/pgevent.dll
To unregister the event log library from the operating system, issue this command:
regsvr32 /u [/i:event_source_name
] pgsql_library_directory
/pgevent.dll
To enable event logging in the database server, modify
log_destination to include
eventlog
in postgresql.conf
.
Table of Contents
There are many configuration parameters that affect the behavior of the database system. In the first section of this chapter we describe how to interact with configuration parameters. The subsequent sections discuss each parameter in detail.
All parameter names are case-insensitive. Every parameter takes a value of one of five types: boolean, string, integer, floating point, or enumerated (enum). The type determines the syntax for setting the parameter:
Boolean:
Values can be written as
on
,
off
,
true
,
false
,
yes
,
no
,
1
,
0
(all case-insensitive) or any unambiguous prefix of one of these.
String: In general, enclose the value in single quotes, doubling any single quotes within the value. Quotes can usually be omitted if the value is a simple number or identifier, however. (Values that match an SQL keyword require quoting in some contexts.)
Numeric (integer and floating point):
Numeric parameters can be specified in the customary integer and
floating-point formats; fractional values are rounded to the nearest
integer if the parameter is of integer type. Integer parameters
additionally accept hexadecimal input (beginning
with 0x
) and octal input (beginning
with 0
), but these formats cannot have a fraction.
Do not use thousands separators.
Quotes are not required, except for hexadecimal input.
Numeric with Unit:
Some numeric parameters have an implicit unit, because they describe
quantities of memory or time. The unit might be bytes, kilobytes, blocks
(typically eight kilobytes), milliseconds, seconds, or minutes.
An unadorned numeric value for one of these settings will use the
setting's default unit, which can be learned from
pg_settings
.unit
.
For convenience, settings can be given with a unit specified explicitly,
for example '120 ms'
for a time value, and they will be
converted to whatever the parameter's actual unit is. Note that the
value must be written as a string (with quotes) to use this feature.
The unit name is case-sensitive, and there can be whitespace between
the numeric value and the unit.
Valid memory units are B
(bytes),
kB
(kilobytes),
MB
(megabytes), GB
(gigabytes), and TB
(terabytes).
The multiplier for memory units is 1024, not 1000.
Valid time units are
us
(microseconds),
ms
(milliseconds),
s
(seconds), min
(minutes),
h
(hours), and d
(days).
If a fractional value is specified with a unit, it will be rounded
to a multiple of the next smaller unit if there is one.
For example, 30.1 GB
will be converted
to 30822 MB
not 32319628902 B
.
If the parameter is of integer type, a final rounding to integer
occurs after any unit conversion.
Enumerated:
Enumerated-type parameters are written in the same way as string
parameters, but are restricted to have one of a limited set of
values. The values allowable for such a parameter can be found from
pg_settings
.enumvals
.
Enum parameter values are case-insensitive.
The most fundamental way to set these parameters is to edit the file
postgresql.conf
,
which is normally kept in the data directory. A default copy is
installed when the database cluster directory is initialized.
An example of what this file might look like is:
# This is a comment log_connections = yes log_destination = 'syslog' search_path = '"$user", public' shared_buffers = 128MB
One parameter is specified per line. The equal sign between name and
value is optional. Whitespace is insignificant (except within a quoted
parameter value) and blank lines are
ignored. Hash marks (#
) designate the remainder
of the line as a comment. Parameter values that are not simple
identifiers or numbers must be single-quoted. To embed a single
quote in a parameter value, write either two quotes (preferred)
or backslash-quote.
If the file contains multiple entries for the same parameter,
all but the last one are ignored.
Parameters set in this way provide default values for the cluster. The settings seen by active sessions will be these values unless they are overridden. The following sections describe ways in which the administrator or user can override these defaults.
The configuration file is reread whenever the main server process
receives a SIGHUP signal; this signal is most easily
sent by running pg_ctl reload
from the command line or by
calling the SQL function pg_reload_conf()
. The main
server process also propagates this signal to all currently running
server processes, so that existing sessions also adopt the new values
(this will happen after they complete any currently-executing client
command). Alternatively, you can
send the signal to a single server process directly. Some parameters
can only be set at server start; any changes to their entries in the
configuration file will be ignored until the server is restarted.
Invalid parameter settings in the configuration file are likewise
ignored (but logged) during SIGHUP processing.
In addition to postgresql.conf
,
a PostgreSQL data directory contains a file
postgresql.auto.conf
,
which has the same format as postgresql.conf
but
is intended to be edited automatically, not manually. This file holds
settings provided through the ALTER SYSTEM
command.
This file is read whenever postgresql.conf
is,
and its settings take effect in the same way. Settings
in postgresql.auto.conf
override those
in postgresql.conf
.
External tools may also
modify postgresql.auto.conf
. It is not
recommended to do this while the server is running, since a
concurrent ALTER SYSTEM
command could overwrite
such changes. Such tools might simply append new settings to the end,
or they might choose to remove duplicate settings and/or comments
(as ALTER SYSTEM
will).
The system view
pg_file_settings
can be helpful for pre-testing changes to the configuration files, or for
diagnosing problems if a SIGHUP signal did not have the
desired effects.
PostgreSQL provides three SQL
commands to establish configuration defaults.
The already-mentioned ALTER SYSTEM
command
provides an SQL-accessible means of changing global defaults; it is
functionally equivalent to editing postgresql.conf
.
In addition, there are two commands that allow setting of defaults
on a per-database or per-role basis:
The ALTER DATABASE
command allows global
settings to be overridden on a per-database basis.
The ALTER ROLE
command allows both global and
per-database settings to be overridden with user-specific values.
Values set with ALTER DATABASE
and ALTER ROLE
are applied only when starting a fresh database session. They
override values obtained from the configuration files or server
command line, and constitute defaults for the rest of the session.
Note that some settings cannot be changed after server start, and
so cannot be set with these commands (or the ones listed below).
Once a client is connected to the database, PostgreSQL provides two additional SQL commands (and equivalent functions) to interact with session-local configuration settings:
The SHOW
command allows inspection of the
current value of any parameter. The corresponding SQL function is
current_setting(setting_name text)
(see Section 9.27.1).
The SET
command allows modification of the
current value of those parameters that can be set locally to a
session; it has no effect on other sessions.
The corresponding SQL function is
set_config(setting_name, new_value, is_local)
(see Section 9.27.1).
In addition, the system view pg_settings
can be
used to view and change session-local values:
Querying this view is similar to using SHOW ALL
but
provides more detail. It is also more flexible, since it's possible
to specify filter conditions or join against other relations.
Using UPDATE
on this view, specifically
updating the setting
column, is the equivalent
of issuing SET
commands. For example, the equivalent of
SET configuration_parameter TO DEFAULT;
is:
UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter';
In addition to setting global defaults or attaching overrides at the database or role level, you can pass settings to PostgreSQL via shell facilities. Both the server and libpq client library accept parameter values via the shell.
During server startup, parameter settings can be
passed to the postgres
command via the
-c
command-line parameter. For example,
postgres -c log_connections=yes -c log_destination='syslog'
Settings provided in this way override those set via
postgresql.conf
or ALTER SYSTEM
,
so they cannot be changed globally without restarting the server.
When starting a client session via libpq,
parameter settings can be
specified using the PGOPTIONS
environment variable.
Settings established in this way constitute defaults for the life
of the session, but do not affect other sessions.
For historical reasons, the format of PGOPTIONS
is
similar to that used when launching the postgres
command; specifically, the -c
flag must be specified.
For example,
env PGOPTIONS="-c geqo=off -c statement_timeout=5min" psql
Other clients and libraries might provide their own mechanisms, via the shell or otherwise, that allow the user to alter session settings without direct use of SQL commands.
PostgreSQL provides several features for breaking
down complex postgresql.conf
files into sub-files.
These features are especially useful when managing multiple servers
with related, but not identical, configurations.
In addition to individual parameter settings,
the postgresql.conf
file can contain include
directives, which specify another file to read and process as if
it were inserted into the configuration file at this point. This
feature allows a configuration file to be divided into physically
separate parts. Include directives simply look like:
include 'filename'
If the file name is not an absolute path, it is taken as relative to the directory containing the referencing configuration file. Inclusions can be nested.
There is also an include_if_exists
directive, which acts
the same as the include
directive, except
when the referenced file does not exist or cannot be read. A regular
include
will consider this an error condition, but
include_if_exists
merely logs a message and continues
processing the referencing configuration file.
The postgresql.conf
file can also contain
include_dir
directives, which specify an entire
directory of configuration files to include. These look like
include_dir 'directory'
Non-absolute directory names are taken as relative to the directory
containing the referencing configuration file. Within the specified
directory, only non-directory files whose names end with the
suffix .conf
will be included. File names that
start with the .
character are also ignored, to
prevent mistakes since such files are hidden on some platforms. Multiple
files within an include directory are processed in file name order
(according to C locale rules, i.e., numbers before letters, and
uppercase letters before lowercase ones).
Include files or directories can be used to logically separate portions
of the database configuration, rather than having a single large
postgresql.conf
file. Consider a company that has two
database servers, each with a different amount of memory. There are
likely elements of the configuration both will share, for things such
as logging. But memory-related parameters on the server will vary
between the two. And there might be server specific customizations,
too. One way to manage this situation is to break the custom
configuration changes for your site into three files. You could add
this to the end of your postgresql.conf
file to include
them:
include 'shared.conf' include 'memory.conf' include 'server.conf'
All systems would have the same shared.conf
. Each
server with a particular amount of memory could share the
same memory.conf
; you might have one for all servers
with 8GB of RAM, another for those having 16GB. And
finally server.conf
could have truly server-specific
configuration information in it.
Another possibility is to create a configuration file directory and
put this information into files there. For example, a conf.d
directory could be referenced at the end of postgresql.conf
:
include_dir 'conf.d'
Then you could name the files in the conf.d
directory
like this:
00shared.conf 01memory.conf 02server.conf
This naming convention establishes a clear order in which these
files will be loaded. This is important because only the last
setting encountered for a particular parameter while the server is
reading configuration files will be used. In this example,
something set in conf.d/02server.conf
would override a
value set in conf.d/01memory.conf
.
You might instead use this approach to naming the files descriptively:
00shared.conf 01memory-8GB.conf 02server-foo.conf
This sort of arrangement gives a unique name for each configuration file variation. This can help eliminate ambiguity when several servers have their configurations all stored in one place, such as in a version control repository. (Storing database configuration files under version control is another good practice to consider.)
In addition to the postgresql.conf
file
already mentioned, PostgreSQL uses
two other manually-edited configuration files, which control
client authentication (their use is discussed in Chapter 21). By default, all three
configuration files are stored in the database cluster's data
directory. The parameters described in this section allow the
configuration files to be placed elsewhere. (Doing so can ease
administration. In particular it is often easier to ensure that
the configuration files are properly backed-up when they are
kept separate.)
data_directory
(string
)
Specifies the directory to use for data storage. This parameter can only be set at server start.
config_file
(string
)
Specifies the main server configuration file
(customarily called postgresql.conf
).
This parameter can only be set on the postgres
command line.
hba_file
(string
)
Specifies the configuration file for host-based authentication
(customarily called pg_hba.conf
).
This parameter can only be set at server start.
ident_file
(string
)
Specifies the configuration file for user name mapping
(customarily called pg_ident.conf
).
This parameter can only be set at server start.
See also Section 21.2.
external_pid_file
(string
)
Specifies the name of an additional process-ID (PID) file that the server should create for use by server administration programs. This parameter can only be set at server start.
In a default installation, none of the above parameters are set
explicitly. Instead, the
data directory is specified by the -D
command-line
option or the PGDATA
environment variable, and the
configuration files are all found within the data directory.
If you wish to keep the configuration files elsewhere than the
data directory, the postgres
-D
command-line option or PGDATA
environment variable
must point to the directory containing the configuration files,
and the data_directory
parameter must be set in
postgresql.conf
(or on the command line) to show
where the data directory is actually located. Notice that
data_directory
overrides -D
and
PGDATA
for the location
of the data directory, but not for the location of the configuration
files.
If you wish, you can specify the configuration file names and locations
individually using the parameters config_file
,
hba_file
and/or ident_file
.
config_file
can only be specified on the
postgres
command line, but the others can be
set within the main configuration file. If all three parameters plus
data_directory
are explicitly set, then it is not necessary
to specify -D
or PGDATA
.
When setting any of these parameters, a relative path will be interpreted
with respect to the directory in which postgres
is started.
listen_addresses
(string
)
Specifies the TCP/IP address(es) on which the server is
to listen for connections from client applications.
The value takes the form of a comma-separated list of host names
and/or numeric IP addresses. The special entry *
corresponds to all available IP interfaces. The entry
0.0.0.0
allows listening for all IPv4 addresses and
::
allows listening for all IPv6 addresses.
If the list is empty, the server does not listen on any IP interface
at all, in which case only Unix-domain sockets can be used to connect
to it. If the list is not empty, the server will start if it
can listen on at least one TCP/IP address. A warning will be
emitted for any TCP/IP address which cannot be opened.
The default value is localhost,
which allows only local TCP/IP “loopback” connections to be
made.
While client authentication (Chapter 21) allows fine-grained control
over who can access the server, listen_addresses
controls which interfaces accept connection attempts, which
can help prevent repeated malicious connection requests on
insecure network interfaces. This parameter can only be set
at server start.
port
(integer
)
The TCP port the server listens on; 5432 by default. Note that the same port number is used for all IP addresses the server listens on. This parameter can only be set at server start.
max_connections
(integer
)
Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections, but might be less if your kernel settings will not support it (as determined during initdb). This parameter can only be set at server start.
When running a standby server, you must set this parameter to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
superuser_reserved_connections
(integer
)
Determines the number of connection “slots” that
are reserved for connections by PostgreSQL
superusers. At most max_connections
connections can ever be active simultaneously. Whenever the
number of active concurrent connections is at least
max_connections
minus
superuser_reserved_connections
, new
connections will be accepted only for superusers, and no
new replication connections will be accepted.
The default value is three connections. The value must be less
than max_connections
.
This parameter can only be set at server start.
unix_socket_directories
(string
)
Specifies the directory of the Unix-domain socket(s) on which the server is to listen for connections from client applications. Multiple sockets can be created by listing multiple directories separated by commas. Whitespace between entries is ignored; surround a directory name with double quotes if you need to include whitespace or commas in the name. An empty value specifies not listening on any Unix-domain sockets, in which case only TCP/IP sockets can be used to connect to the server.
A value that starts with @
specifies that a
Unix-domain socket in the abstract namespace should be created
(currently supported on Linux only). In that case, this value
does not specify a “directory” but a prefix from which
the actual socket name is computed in the same manner as for the
file-system namespace. While the abstract socket name prefix can be
chosen freely, since it is not a file-system location, the convention
is to nonetheless use file-system-like values such as
@/tmp
.
The default value is normally
/tmp
, but that can be changed at build time.
On Windows, the default is empty, which means no Unix-domain socket is
created by default.
This parameter can only be set at server start.
In addition to the socket file itself, which is named
.s.PGSQL.
where
nnnn
nnnn
is the server's port number, an ordinary file
named .s.PGSQL.
will be
created in each of the nnnn
.lockunix_socket_directories
directories.
Neither file should ever be removed manually.
For sockets in the abstract namespace, no lock file is created.
unix_socket_group
(string
)
Sets the owning group of the Unix-domain socket(s). (The owning
user of the sockets is always the user that starts the
server.) In combination with the parameter
unix_socket_permissions
this can be used as
an additional access control mechanism for Unix-domain connections.
By default this is the empty string, which uses the default
group of the server user. This parameter can only be set at
server start.
This parameter is not supported on Windows. Any setting will be ignored. Also, sockets in the abstract namespace have no file owner, so this setting is also ignored in that case.
unix_socket_permissions
(integer
)
Sets the access permissions of the Unix-domain socket(s). Unix-domain
sockets use the usual Unix file system permission set.
The parameter value is expected to be a numeric mode
specified in the format accepted by the
chmod
and umask
system calls. (To use the customary octal format the number
must start with a 0
(zero).)
The default permissions are 0777
, meaning
anyone can connect. Reasonable alternatives are
0770
(only user and group, see also
unix_socket_group
) and 0700
(only user). (Note that for a Unix-domain socket, only write
permission matters, so there is no point in setting or revoking
read or execute permissions.)
This access control mechanism is independent of the one described in Chapter 21.
This parameter can only be set at server start.
This parameter is irrelevant on systems, notably Solaris as of Solaris
10, that ignore socket permissions entirely. There, one can achieve a
similar effect by pointing unix_socket_directories
to a
directory having search permission limited to the desired audience.
Sockets in the abstract namespace have no file permissions, so this setting is also ignored in that case.
bonjour
(boolean
)
Enables advertising the server's existence via Bonjour. The default is off. This parameter can only be set at server start.
bonjour_name
(string
)
Specifies the Bonjour service
name. The computer name is used if this parameter is set to the
empty string ''
(which is the default). This parameter is
ignored if the server was not compiled with
Bonjour support.
This parameter can only be set at server start.
tcp_keepalives_idle
(integer
)
Specifies the amount of time with no network activity after which
the operating system should send a TCP keepalive message to the client.
If this value is specified without units, it is taken as seconds.
A value of 0 (the default) selects the operating system's default.
This parameter is supported only on systems that support
TCP_KEEPIDLE
or an equivalent socket option, and on
Windows; on other systems, it must be zero.
In sessions connected via a Unix-domain socket, this parameter is
ignored and always reads as zero.
On Windows, setting a value of 0 will set this parameter to 2 hours, since Windows does not provide a way to read the system default value.
tcp_keepalives_interval
(integer
)
Specifies the amount of time after which a TCP keepalive message
that has not been acknowledged by the client should be retransmitted.
If this value is specified without units, it is taken as seconds.
A value of 0 (the default) selects the operating system's default.
This parameter is supported only on systems that support
TCP_KEEPINTVL
or an equivalent socket option, and on
Windows; on other systems, it must be zero.
In sessions connected via a Unix-domain socket, this parameter is
ignored and always reads as zero.
On Windows, setting a value of 0 will set this parameter to 1 second, since Windows does not provide a way to read the system default value.
tcp_keepalives_count
(integer
)
Specifies the number of TCP keepalive messages that can be lost before
the server's connection to the client is considered dead.
A value of 0 (the default) selects the operating system's default.
This parameter is supported only on systems that support
TCP_KEEPCNT
or an equivalent socket option;
on other systems, it must be zero.
In sessions connected via a Unix-domain socket, this parameter is
ignored and always reads as zero.
This parameter is not supported on Windows, and must be zero.
tcp_user_timeout
(integer
)
Specifies the amount of time that transmitted data may
remain unacknowledged before the TCP connection is forcibly closed.
If this value is specified without units, it is taken as milliseconds.
A value of 0 (the default) selects the operating system's default.
This parameter is supported only on systems that support
TCP_USER_TIMEOUT
; on other systems, it must be zero.
In sessions connected via a Unix-domain socket, this parameter is
ignored and always reads as zero.
This parameter is not supported on Windows, and must be zero.
client_connection_check_interval
(integer
)
Sets the time interval between optional checks that the client is still connected, while running queries. The check is performed by polling the socket, and allows long running queries to be aborted sooner if the kernel reports that the connection is closed.
This option is currently available only on systems that support the
non-standard POLLRDHUP
extension to the
poll
system call, including Linux.
If the value is specified without units, it is taken as milliseconds.
The default value is 0
, which disables connection
checks. Without connection checks, the server will detect the loss of
the connection only at the next interaction with the socket, when it
waits for, receives or sends data.
For the kernel itself to detect lost TCP connections reliably and within a known timeframe in all scenarios including network failure, it may also be necessary to adjust the TCP keepalive settings of the operating system, or the tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count settings of PostgreSQL.
authentication_timeout
(integer
)
Maximum amount of time allowed to complete client authentication. If a
would-be client has not completed the authentication protocol in
this much time, the server closes the connection. This prevents
hung clients from occupying a connection indefinitely.
If this value is specified without units, it is taken as seconds.
The default is one minute (1m
).
This parameter can only be set in the postgresql.conf
file or on the server command line.
password_encryption
(enum
)
When a password is specified in CREATE ROLE or
ALTER ROLE, this parameter determines the
algorithm to use to encrypt the password. Possible values are
scram-sha-256
, which will encrypt the password with
SCRAM-SHA-256, and md5
, which stores the password
as an MD5 hash. The default is scram-sha-256
.
Note that older clients might lack support for the SCRAM authentication mechanism, and hence not work with passwords encrypted with SCRAM-SHA-256. See Section 21.5 for more details.
krb_server_keyfile
(string
)
Sets the location of the server's Kerberos key file. The default is
FILE:/usr/local/pgsql/etc/krb5.keytab
(where the directory part is whatever was specified
as sysconfdir
at build time; use
pg_config --sysconfdir
to determine that).
If this parameter is set to an empty string, it is ignored and a
system-dependent default is used.
This parameter can only be set in the
postgresql.conf
file or on the server command line.
See Section 21.6 for more information.
krb_caseins_users
(boolean
)
Sets whether GSSAPI user names should be treated
case-insensitively.
The default is off
(case sensitive). This parameter can only be
set in the postgresql.conf
file or on the server command line.
db_user_namespace
(boolean
)
This parameter enables per-database user names. It is off by default.
This parameter can only be set in the postgresql.conf
file or on the server command line.
If this is on, you should create users as username@dbname
.
When username
is passed by a connecting client,
@
and the database name are appended to the user
name and that database-specific user name is looked up by the
server. Note that when you create users with names containing
@
within the SQL environment, you will need to
quote the user name.
With this parameter enabled, you can still create ordinary global
users. Simply append @
when specifying the user
name in the client, e.g., joe@
. The @
will be stripped off before the user name is looked up by the
server.
db_user_namespace
causes the client's and
server's user name representation to differ.
Authentication checks are always done with the server's user name
so authentication methods must be configured for the
server's user name, not the client's. Because
md5
uses the user name as salt on both the
client and server, md5
cannot be used with
db_user_namespace
.
This feature is intended as a temporary measure until a complete solution is found. At that time, this option will be removed.
See Section 19.9 for more information about setting up SSL.
ssl
(boolean
)
Enables SSL connections.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is off
.
ssl_ca_file
(string
)
Specifies the name of the file containing the SSL server certificate
authority (CA).
Relative paths are relative to the data directory.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is empty, meaning no CA file is loaded,
and client certificate verification is not performed.
ssl_cert_file
(string
)
Specifies the name of the file containing the SSL server certificate.
Relative paths are relative to the data directory.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is server.crt
.
ssl_crl_file
(string
)
Specifies the name of the file containing the SSL client certificate
revocation list (CRL).
Relative paths are relative to the data directory.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is empty, meaning no CRL file is loaded (unless
ssl_crl_dir is set).
ssl_crl_dir
(string
)
Specifies the name of the directory containing the SSL client
certificate revocation list (CRL). Relative paths are relative to the
data directory. This parameter can only be set in
the postgresql.conf
file or on the server command
line. The default is empty, meaning no CRLs are used (unless
ssl_crl_file is set).
The directory needs to be prepared with the
OpenSSL command
openssl rehash
or c_rehash
. See
its documentation for details.
When using this setting, CRLs in the specified directory are loaded on-demand at connection time. New CRLs can be added to the directory and will be used immediately. This is unlike ssl_crl_file, which causes the CRL in the file to be loaded at server start time or when the configuration is reloaded. Both settings can be used together.
ssl_key_file
(string
)
Specifies the name of the file containing the SSL server private key.
Relative paths are relative to the data directory.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is server.key
.
ssl_ciphers
(string
)
Specifies a list of SSL cipher suites that are
allowed to be used by SSL connections. See the
ciphers
manual page in the OpenSSL package for the
syntax of this setting and a list of supported values. Only
connections using TLS version 1.2 and lower are affected. There is
currently no setting that controls the cipher choices used by TLS
version 1.3 connections. The default value is
HIGH:MEDIUM:+3DES:!aNULL
. The default is usually a
reasonable choice unless you have specific security requirements.
This parameter can only be set in the
postgresql.conf
file or on the server command
line.
Explanation of the default value:
HIGH
Cipher suites that use ciphers from HIGH
group (e.g.,
AES, Camellia, 3DES)
MEDIUM
Cipher suites that use ciphers from MEDIUM
group
(e.g., RC4, SEED)
+3DES
The OpenSSL default order for
HIGH
is problematic because it orders 3DES
higher than AES128. This is wrong because 3DES offers less
security than AES128, and it is also much slower.
+3DES
reorders it after all other
HIGH
and MEDIUM
ciphers.
!aNULL
Disables anonymous cipher suites that do no authentication. Such cipher suites are vulnerable to MITM attacks and therefore should not be used.
Available cipher suite details will vary across
OpenSSL versions. Use the command
openssl ciphers -v 'HIGH:MEDIUM:+3DES:!aNULL'
to
see actual details for the currently installed
OpenSSL version. Note that this list is
filtered at run time based on the server key type.
ssl_prefer_server_ciphers
(boolean
)
Specifies whether to use the server's SSL cipher preferences, rather
than the client's.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is on
.
PostgreSQL versions before 9.4 do not have this setting and always use the client's preferences. This setting is mainly for backward compatibility with those versions. Using the server's preferences is usually better because it is more likely that the server is appropriately configured.
ssl_ecdh_curve
(string
)
Specifies the name of the curve to use in ECDH key
exchange. It needs to be supported by all clients that connect.
It does not need to be the same curve used by the server's Elliptic
Curve key.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is prime256v1
.
OpenSSL names for the most common curves
are:
prime256v1
(NIST P-256),
secp384r1
(NIST P-384),
secp521r1
(NIST P-521).
The full list of available curves can be shown with the command
openssl ecparam -list_curves
. Not all of them
are usable in TLS though.
ssl_min_protocol_version
(enum
)
Sets the minimum SSL/TLS protocol version to use. Valid values are
currently: TLSv1
, TLSv1.1
,
TLSv1.2
, TLSv1.3
. Older
versions of the OpenSSL library do not
support all values; an error will be raised if an unsupported setting
is chosen. Protocol versions before TLS 1.0, namely SSL version 2 and
3, are always disabled.
The default is TLSv1.2
, which satisfies industry
best practices as of this writing.
This parameter can only be set in the postgresql.conf
file or on the server command line.
ssl_max_protocol_version
(enum
)
Sets the maximum SSL/TLS protocol version to use. Valid values are as for ssl_min_protocol_version, with addition of an empty string, which allows any protocol version. The default is to allow any version. Setting the maximum protocol version is mainly useful for testing or if some component has issues working with a newer protocol.
This parameter can only be set in the postgresql.conf
file or on the server command line.
ssl_dh_params_file
(string
)
Specifies the name of the file containing Diffie-Hellman parameters
used for so-called ephemeral DH family of SSL ciphers. The default is
empty, in which case compiled-in default DH parameters used. Using
custom DH parameters reduces the exposure if an attacker manages to
crack the well-known compiled-in DH parameters. You can create your own
DH parameters file with the command
openssl dhparam -out dhparams.pem 2048
.
This parameter can only be set in the postgresql.conf
file or on the server command line.
ssl_passphrase_command
(string
)
Sets an external command to be invoked when a passphrase for decrypting an SSL file such as a private key needs to be obtained. By default, this parameter is empty, which means the built-in prompting mechanism is used.
The command must print the passphrase to the standard output and exit
with code 0. In the parameter value, %p
is
replaced by a prompt string. (Write %%
for a
literal %
.) Note that the prompt string will
probably contain whitespace, so be sure to quote adequately. A single
newline is stripped from the end of the output if present.
The command does not actually have to prompt the user for a passphrase. It can read it from a file, obtain it from a keychain facility, or similar. It is up to the user to make sure the chosen mechanism is adequately secure.
This parameter can only be set in the postgresql.conf
file or on the server command line.
ssl_passphrase_command_supports_reload
(boolean
)
This parameter determines whether the passphrase command set by
ssl_passphrase_command
will also be called during a
configuration reload if a key file needs a passphrase. If this
parameter is off (the default), then
ssl_passphrase_command
will be ignored during a
reload and the SSL configuration will not be reloaded if a passphrase
is needed. That setting is appropriate for a command that requires a
TTY for prompting, which might not be available when the server is
running. Setting this parameter to on might be appropriate if the
passphrase is obtained from a file, for example.
This parameter can only be set in the postgresql.conf
file or on the server command line.
shared_buffers
(integer
)
Sets the amount of memory the database server uses for shared
memory buffers. The default is typically 128 megabytes
(128MB
), but might be less if your kernel settings will
not support it (as determined during initdb).
This setting must be at least 128 kilobytes. However,
settings significantly higher than the minimum are usually needed
for good performance.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
(Non-default values of BLCKSZ
change the minimum
value.)
This parameter can only be set at server start.
If you have a dedicated database server with 1GB or more of RAM, a
reasonable starting value for shared_buffers
is 25%
of the memory in your system. There are some workloads where even
larger settings for shared_buffers
are effective, but
because PostgreSQL also relies on the
operating system cache, it is unlikely that an allocation of more than
40% of RAM to shared_buffers
will work better than a
smaller amount. Larger settings for shared_buffers
usually require a corresponding increase in
max_wal_size
, in order to spread out the
process of writing large quantities of new or changed data over a
longer period of time.
On systems with less than 1GB of RAM, a smaller percentage of RAM is appropriate, so as to leave adequate space for the operating system.
huge_pages
(enum
)
Controls whether huge pages are requested for the main shared memory
area. Valid values are try
(the default),
on
, and off
. With
huge_pages
set to try
, the
server will try to request huge pages, but fall back to the default if
that fails. With on
, failure to request huge pages
will prevent the server from starting up. With off
,
huge pages will not be requested.
At present, this setting is supported only on Linux and Windows. The
setting is ignored on other systems when set to
try
. On Linux, it is only supported when
shared_memory_type
is set to mmap
(the default).
The use of huge pages results in smaller page tables and less CPU time spent on memory management, increasing performance. For more details about using huge pages on Linux, see Section 19.4.5.
Huge pages are known as large pages on Windows. To use them, you need to assign the user right “Lock pages in memory” to the Windows user account that runs PostgreSQL. You can use Windows Group Policy tool (gpedit.msc) to assign the user right “Lock pages in memory”. To start the database server on the command prompt as a standalone process, not as a Windows service, the command prompt must be run as an administrator or User Access Control (UAC) must be disabled. When the UAC is enabled, the normal command prompt revokes the user right “Lock pages in memory” when started.
Note that this setting only affects the main shared memory area.
Operating systems such as Linux, FreeBSD, and Illumos can also use
huge pages (also known as “super” pages or
“large” pages) automatically for normal memory
allocation, without an explicit request from
PostgreSQL. On Linux, this is called
“transparent huge pages” (THP). That feature has been known to
cause performance degradation with
PostgreSQL for some users on some Linux
versions, so its use is currently discouraged (unlike explicit use of
huge_pages
).
huge_page_size
(integer
)
Controls the size of huge pages, when they are enabled with
huge_pages.
The default is zero (0
).
When set to 0
, the default huge page size on the
system will be used. This parameter can only be set at server start.
Some commonly available page sizes on modern 64 bit server architectures include:
2MB
and 1GB
(Intel and AMD), 16MB
and
16GB
(IBM POWER), and 64kB
, 2MB
,
32MB
and 1GB
(ARM). For more information
about usage and support, see Section 19.4.5.
Non-default settings are currently supported only on Linux.
temp_buffers
(integer
)
Sets the maximum amount of memory used for temporary buffers within
each database session. These are session-local buffers used only
for access to temporary tables.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The default is eight megabytes (8MB
).
(If BLCKSZ
is not 8kB, the default value scales
proportionally to it.)
This setting can be changed within individual
sessions, but only before the first use of temporary tables
within the session; subsequent attempts to change the value will
have no effect on that session.
A session will allocate temporary buffers as needed up to the limit
given by temp_buffers
. The cost of setting a large
value in sessions that do not actually need many temporary
buffers is only a buffer descriptor, or about 64 bytes, per
increment in temp_buffers
. However if a buffer is
actually used an additional 8192 bytes will be consumed for it
(or in general, BLCKSZ
bytes).
max_prepared_transactions
(integer
)
Sets the maximum number of transactions that can be in the “prepared” state simultaneously (see PREPARE TRANSACTION). Setting this parameter to zero (which is the default) disables the prepared-transaction feature. This parameter can only be set at server start.
If you are not planning to use prepared transactions, this parameter
should be set to zero to prevent accidental creation of prepared
transactions. If you are using prepared transactions, you will
probably want max_prepared_transactions
to be at
least as large as max_connections, so that every
session can have a prepared transaction pending.
When running a standby server, you must set this parameter to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
work_mem
(integer
)
Sets the base maximum amount of memory to be used by a query operation
(such as a sort or hash table) before writing to temporary disk files.
If this value is specified without units, it is taken as kilobytes.
The default value is four megabytes (4MB
).
Note that a complex query might perform several sort and hash
operations at the same time, with each operation generally being
allowed to use as much memory as this value specifies before
it starts
to write data into temporary files. Also, several running
sessions could be doing such operations concurrently.
Therefore, the total memory used could be many times the value
of work_mem
; it is necessary to keep this
fact in mind when choosing the value. Sort operations are used
for ORDER BY
, DISTINCT
,
and merge joins.
Hash tables are used in hash joins, hash-based aggregation, memoize
nodes and hash-based processing of IN
subqueries.
Hash-based operations are generally more sensitive to memory
availability than equivalent sort-based operations. The
memory limit for a hash table is computed by multiplying
work_mem
by
hash_mem_multiplier
. This makes it
possible for hash-based operations to use an amount of memory
that exceeds the usual work_mem
base
amount.
hash_mem_multiplier
(floating point
)
Used to compute the maximum amount of memory that hash-based
operations can use. The final limit is determined by
multiplying work_mem
by
hash_mem_multiplier
. The default value is
1.0, which makes hash-based operations subject to the same
simple work_mem
maximum as sort-based
operations.
Consider increasing hash_mem_multiplier
in
environments where spilling by query operations is a regular
occurrence, especially when simply increasing
work_mem
results in memory pressure (memory
pressure typically takes the form of intermittent out of
memory errors). A setting of 1.5 or 2.0 may be effective with
mixed workloads. Higher settings in the range of 2.0 - 8.0 or
more may be effective in environments where
work_mem
has already been increased to 40MB
or more.
maintenance_work_mem
(integer
)
Specifies the maximum amount of memory to be used by maintenance
operations, such as VACUUM
, CREATE
INDEX
, and ALTER TABLE ADD FOREIGN KEY
.
If this value is specified without units, it is taken as kilobytes.
It defaults
to 64 megabytes (64MB
). Since only one of these
operations can be executed at a time by a database session, and
an installation normally doesn't have many of them running
concurrently, it's safe to set this value significantly larger
than work_mem
. Larger settings might improve
performance for vacuuming and for restoring database dumps.
Note that when autovacuum runs, up to autovacuum_max_workers times this memory may be allocated, so be careful not to set the default value too high. It may be useful to control for this by separately setting autovacuum_work_mem.
Note that for the collection of dead tuple identifiers,
VACUUM
is only able to utilize up to a maximum of
1GB
of memory.
autovacuum_work_mem
(integer
)
Specifies the maximum amount of memory to be used by each
autovacuum worker process.
If this value is specified without units, it is taken as kilobytes.
It defaults to -1, indicating that
the value of maintenance_work_mem should
be used instead. The setting has no effect on the behavior of
VACUUM
when run in other contexts.
This parameter can only be set in the
postgresql.conf
file or on the server command
line.
For the collection of dead tuple identifiers, autovacuum is only able
to utilize up to a maximum of 1GB
of memory, so
setting autovacuum_work_mem
to a value higher than
that has no effect on the number of dead tuples that autovacuum can
collect while scanning a table.
logical_decoding_work_mem
(integer
)
Specifies the maximum amount of memory to be used by logical decoding,
before some of the decoded changes are written to local disk. This
limits the amount of memory used by logical streaming replication
connections. It defaults to 64 megabytes (64MB
).
Since each replication connection only uses a single buffer of this size,
and an installation normally doesn't have many such connections
concurrently (as limited by max_wal_senders
), it's
safe to set this value significantly higher than work_mem
,
reducing the amount of decoded changes written to disk.
max_stack_depth
(integer
)
Specifies the maximum safe depth of the server's execution stack.
The ideal setting for this parameter is the actual stack size limit
enforced by the kernel (as set by ulimit -s
or local
equivalent), less a safety margin of a megabyte or so. The safety
margin is needed because the stack depth is not checked in every
routine in the server, but only in key potentially-recursive routines.
If this value is specified without units, it is taken as kilobytes.
The default setting is two megabytes (2MB
), which
is conservatively small and unlikely to risk crashes. However,
it might be too small to allow execution of complex functions.
Only superusers can change this setting.
Setting max_stack_depth
higher than
the actual kernel limit will mean that a runaway recursive function
can crash an individual backend process. On platforms where
PostgreSQL can determine the kernel limit,
the server will not allow this variable to be set to an unsafe
value. However, not all platforms provide the information,
so caution is recommended in selecting a value.
shared_memory_type
(enum
)
Specifies the shared memory implementation that the server
should use for the main shared memory region that holds
PostgreSQL's shared buffers and other
shared data. Possible values are mmap
(for
anonymous shared memory allocated using mmap
),
sysv
(for System V shared memory allocated via
shmget
) and windows
(for Windows
shared memory). Not all values are supported on all platforms; the
first supported option is the default for that platform. The use of
the sysv
option, which is not the default on any
platform, is generally discouraged because it typically requires
non-default kernel settings to allow for large allocations (see Section 19.4.1).
dynamic_shared_memory_type
(enum
)
Specifies the dynamic shared memory implementation that the server
should use. Possible values are posix
(for POSIX shared
memory allocated using shm_open
), sysv
(for System V shared memory allocated via shmget
),
windows
(for Windows shared memory),
and mmap
(to simulate shared memory using
memory-mapped files stored in the data directory).
Not all values are supported on all platforms; the first supported
option is the default for that platform. The use of the
mmap
option, which is not the default on any platform,
is generally discouraged because the operating system may write
modified pages back to disk repeatedly, increasing system I/O load;
however, it may be useful for debugging, when the
pg_dynshmem
directory is stored on a RAM disk, or when
other shared memory facilities are not available.
min_dynamic_shared_memory
(integer
)
Specifies the amount of memory that should be allocated at server
startup for use by parallel queries. When this memory region is
insufficient or exhausted by concurrent queries, new parallel queries
try to allocate extra shared memory temporarily from the operating
system using the method configured with
dynamic_shared_memory_type
, which may be slower due
to memory management overheads. Memory that is allocated at startup
with min_dynamic_shared_memory
is affected by
the huge_pages
setting on operating systems where
that is supported, and may be more likely to benefit from larger pages
on operating systems where that is managed automatically.
The default value is 0
(none). This parameter can
only be set at server start.
temp_file_limit
(integer
)
Specifies the maximum amount of disk space that a process can use
for temporary files, such as sort and hash temporary files, or the
storage file for a held cursor. A transaction attempting to exceed
this limit will be canceled.
If this value is specified without units, it is taken as kilobytes.
-1
(the default) means no limit.
Only superusers can change this setting.
This setting constrains the total space used at any instant by all temporary files used by a given PostgreSQL process. It should be noted that disk space used for explicit temporary tables, as opposed to temporary files used behind-the-scenes in query execution, does not count against this limit.
max_files_per_process
(integer
)
Sets the maximum number of simultaneously open files allowed to each server subprocess. The default is one thousand files. If the kernel is enforcing a safe per-process limit, you don't need to worry about this setting. But on some platforms (notably, most BSD systems), the kernel will allow individual processes to open many more files than the system can actually support if many processes all try to open that many files. If you find yourself seeing “Too many open files” failures, try reducing this setting. This parameter can only be set at server start.
During the execution of VACUUM
and ANALYZE
commands, the system maintains an
internal counter that keeps track of the estimated cost of the
various I/O operations that are performed. When the accumulated
cost reaches a limit (specified by
vacuum_cost_limit
), the process performing
the operation will sleep for a short period of time, as specified by
vacuum_cost_delay
. Then it will reset the
counter and continue execution.
The intent of this feature is to allow administrators to reduce
the I/O impact of these commands on concurrent database
activity. There are many situations where it is not
important that maintenance commands like
VACUUM
and ANALYZE
finish
quickly; however, it is usually very important that these
commands do not significantly interfere with the ability of the
system to perform other database operations. Cost-based vacuum
delay provides a way for administrators to achieve this.
This feature is disabled by default for manually issued
VACUUM
commands. To enable it, set the
vacuum_cost_delay
variable to a nonzero
value.
vacuum_cost_delay
(floating point
)
The amount of time that the process will sleep when the cost limit has been exceeded. If this value is specified without units, it is taken as milliseconds. The default value is zero, which disables the cost-based vacuum delay feature. Positive values enable cost-based vacuuming.
When using cost-based vacuuming, appropriate values for
vacuum_cost_delay
are usually quite small, perhaps
less than 1 millisecond. While vacuum_cost_delay
can be set to fractional-millisecond values, such delays may not be
measured accurately on older platforms. On such platforms,
increasing VACUUM
's throttled resource consumption
above what you get at 1ms will require changing the other vacuum cost
parameters. You should, nonetheless,
keep vacuum_cost_delay
as small as your platform
will consistently measure; large delays are not helpful.
vacuum_cost_page_hit
(integer
)
The estimated cost for vacuuming a buffer found in the shared buffer cache. It represents the cost to lock the buffer pool, lookup the shared hash table and scan the content of the page. The default value is one.
vacuum_cost_page_miss
(integer
)
The estimated cost for vacuuming a buffer that has to be read from disk. This represents the effort to lock the buffer pool, lookup the shared hash table, read the desired block in from the disk and scan its content. The default value is 2.
vacuum_cost_page_dirty
(integer
)
The estimated cost charged when vacuum modifies a block that was previously clean. It represents the extra I/O required to flush the dirty block out to disk again. The default value is 20.
vacuum_cost_limit
(integer
)
The accumulated cost that will cause the vacuuming process to sleep. The default value is 200.
There are certain operations that hold critical locks and should
therefore complete as quickly as possible. Cost-based vacuum
delays do not occur during such operations. Therefore it is
possible that the cost accumulates far higher than the specified
limit. To avoid uselessly long delays in such cases, the actual
delay is calculated as vacuum_cost_delay
*
accumulated_balance
/
vacuum_cost_limit
with a maximum of
vacuum_cost_delay
* 4.
There is a separate server process called the background writer, whose function is to issue writes of “dirty” (new or modified) shared buffers. When the number of clean shared buffers appears to be insufficient, the background writer writes some dirty buffers to the file system and marks them as clean. This reduces the likelihood that server processes handling user queries will be unable to find clean buffers and have to write dirty buffers themselves. However, the background writer does cause a net overall increase in I/O load, because while a repeatedly-dirtied page might otherwise be written only once per checkpoint interval, the background writer might write it several times as it is dirtied in the same interval. The parameters discussed in this subsection can be used to tune the behavior for local needs.
bgwriter_delay
(integer
)
Specifies the delay between activity rounds for the
background writer. In each round the writer issues writes
for some number of dirty buffers (controllable by the
following parameters). It then sleeps for
the length of bgwriter_delay
, and repeats.
When there are no dirty buffers in the
buffer pool, though, it goes into a longer sleep regardless of
bgwriter_delay
.
If this value is specified without units, it is taken as milliseconds.
The default value is 200
milliseconds (200ms
). Note that on many systems, the
effective resolution of sleep delays is 10 milliseconds; setting
bgwriter_delay
to a value that is not a multiple of 10
might have the same results as setting it to the next higher multiple
of 10. This parameter can only be set in the
postgresql.conf
file or on the server command line.
bgwriter_lru_maxpages
(integer
)
In each round, no more than this many buffers will be written
by the background writer. Setting this to zero disables
background writing. (Note that checkpoints, which are managed by
a separate, dedicated auxiliary process, are unaffected.)
The default value is 100 buffers.
This parameter can only be set in the postgresql.conf
file or on the server command line.
bgwriter_lru_multiplier
(floating point
)
The number of dirty buffers written in each round is based on the
number of new buffers that have been needed by server processes
during recent rounds. The average recent need is multiplied by
bgwriter_lru_multiplier
to arrive at an estimate of the
number of buffers that will be needed during the next round. Dirty
buffers are written until there are that many clean, reusable buffers
available. (However, no more than bgwriter_lru_maxpages
buffers will be written per round.)
Thus, a setting of 1.0 represents a “just in time” policy
of writing exactly the number of buffers predicted to be needed.
Larger values provide some cushion against spikes in demand,
while smaller values intentionally leave writes to be done by
server processes.
The default is 2.0.
This parameter can only be set in the postgresql.conf
file or on the server command line.
bgwriter_flush_after
(integer
)
Whenever more than this amount of data has
been written by the background writer, attempt to force the OS to issue these
writes to the underlying storage. Doing so will limit the amount of
dirty data in the kernel's page cache, reducing the likelihood of
stalls when an fsync
is issued at the end of a checkpoint, or when
the OS writes data back in larger batches in the background. Often
that will result in greatly reduced transaction latency, but there
also are some cases, especially with workloads that are bigger than
shared_buffers, but smaller than the OS's page
cache, where performance might degrade. This setting may have no
effect on some platforms.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The valid range is between
0
, which disables forced writeback, and
2MB
. The default is 512kB
on Linux,
0
elsewhere. (If BLCKSZ
is not 8kB,
the default and maximum values scale proportionally to it.)
This parameter can only be set in the postgresql.conf
file or on the server command line.
Smaller values of bgwriter_lru_maxpages
and
bgwriter_lru_multiplier
reduce the extra I/O load
caused by the background writer, but make it more likely that server
processes will have to issue writes for themselves, delaying interactive
queries.
backend_flush_after
(integer
)
Whenever more than this amount of data has
been written by a single backend, attempt to force the OS to issue
these writes to the underlying storage. Doing so will limit the
amount of dirty data in the kernel's page cache, reducing the
likelihood of stalls when an fsync
is issued at the end of a
checkpoint, or when the OS writes data back in larger batches in the
background. Often that will result in greatly reduced transaction
latency, but there also are some cases, especially with workloads
that are bigger than shared_buffers, but smaller
than the OS's page cache, where performance might degrade. This
setting may have no effect on some platforms.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The valid range is
between 0
, which disables forced writeback,
and 2MB
. The default is 0
, i.e., no
forced writeback. (If BLCKSZ
is not 8kB,
the maximum value scales proportionally to it.)
effective_io_concurrency
(integer
)
Sets the number of concurrent disk I/O operations that PostgreSQL expects can be executed simultaneously. Raising this value will increase the number of I/O operations that any individual PostgreSQL session attempts to initiate in parallel. The allowed range is 1 to 1000, or zero to disable issuance of asynchronous I/O requests. Currently, this setting only affects bitmap heap scans.
For magnetic drives, a good starting point for this setting is the number of separate drives comprising a RAID 0 stripe or RAID 1 mirror being used for the database. (For RAID 5 the parity drive should not be counted.) However, if the database is often busy with multiple queries issued in concurrent sessions, lower values may be sufficient to keep the disk array busy. A value higher than needed to keep the disks busy will only result in extra CPU overhead. SSDs and other memory-based storage can often process many concurrent requests, so the best value might be in the hundreds.
Asynchronous I/O depends on an effective posix_fadvise
function, which some operating systems lack. If the function is not
present then setting this parameter to anything but zero will result
in an error. On some operating systems (e.g., Solaris), the function
is present but does not actually do anything.
The default is 1 on supported systems, otherwise 0. This value can be overridden for tables in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
maintenance_io_concurrency
(integer
)
Similar to effective_io_concurrency
, but used
for maintenance work that is done on behalf of many client sessions.
The default is 10 on supported systems, otherwise 0. This value can be overridden for tables in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
max_worker_processes
(integer
)
Sets the maximum number of background processes that the system can support. This parameter can only be set at server start. The default is 8.
When running a standby server, you must set this parameter to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
When changing this value, consider also adjusting max_parallel_workers, max_parallel_maintenance_workers, and max_parallel_workers_per_gather.
max_parallel_workers_per_gather
(integer
)
Sets the maximum number of workers that can be started by a single
Gather
or Gather Merge
node.
Parallel workers are taken from the pool of processes established by
max_worker_processes, limited by
max_parallel_workers. Note that the requested
number of workers may not actually be available at run time. If this
occurs, the plan will run with fewer workers than expected, which may
be inefficient. The default value is 2. Setting this value to 0
disables parallel query execution.
Note that parallel queries may consume very substantially more
resources than non-parallel queries, because each worker process is
a completely separate process which has roughly the same impact on the
system as an additional user session. This should be taken into
account when choosing a value for this setting, as well as when
configuring other settings that control resource utilization, such
as work_mem. Resource limits such as
work_mem
are applied individually to each worker,
which means the total utilization may be much higher across all
processes than it would normally be for any single process.
For example, a parallel query using 4 workers may use up to 5 times
as much CPU time, memory, I/O bandwidth, and so forth as a query which
uses no workers at all.
For more information on parallel query, see Chapter 15.
max_parallel_maintenance_workers
(integer
)
Sets the maximum number of parallel workers that can be
started by a single utility command. Currently, the parallel
utility commands that support the use of parallel workers are
CREATE INDEX
only when building a B-tree index,
and VACUUM
without FULL
option. Parallel workers are taken from the pool of processes
established by max_worker_processes, limited
by max_parallel_workers. Note that the requested
number of workers may not actually be available at run time.
If this occurs, the utility operation will run with fewer
workers than expected. The default value is 2. Setting this
value to 0 disables the use of parallel workers by utility
commands.
Note that parallel utility commands should not consume
substantially more memory than equivalent non-parallel
operations. This strategy differs from that of parallel
query, where resource limits generally apply per worker
process. Parallel utility commands treat the resource limit
maintenance_work_mem
as a limit to be applied to
the entire utility command, regardless of the number of
parallel worker processes. However, parallel utility
commands may still consume substantially more CPU resources
and I/O bandwidth.
max_parallel_workers
(integer
)
Sets the maximum number of workers that the system can support for parallel operations. The default value is 8. When increasing or decreasing this value, consider also adjusting max_parallel_maintenance_workers and max_parallel_workers_per_gather. Also, note that a setting for this value which is higher than max_worker_processes will have no effect, since parallel workers are taken from the pool of worker processes established by that setting.
parallel_leader_participation
(boolean
)
Allows the leader process to execute the query plan under
Gather
and Gather Merge
nodes
instead of waiting for worker processes. The default is
on
. Setting this value to off
reduces the likelihood that workers will become blocked because the
leader is not reading tuples fast enough, but requires the leader
process to wait for worker processes to start up before the first
tuples can be produced. The degree to which the leader can help or
hinder performance depends on the plan type, number of workers and
query duration.
old_snapshot_threshold
(integer
)
Sets the minimum amount of time that a query snapshot can be used without risk of a “snapshot too old” error occurring when using the snapshot. Data that has been dead for longer than this threshold is allowed to be vacuumed away. This can help prevent bloat in the face of snapshots which remain in use for a long time. To prevent incorrect results due to cleanup of data which would otherwise be visible to the snapshot, an error is generated when the snapshot is older than this threshold and the snapshot is used to read a page which has been modified since the snapshot was built.
If this value is specified without units, it is taken as minutes.
A value of -1
(the default) disables this feature,
effectively setting the snapshot age limit to infinity.
This parameter can only be set at server start.
Useful values for production work probably range from a small number
of hours to a few days. Small values (such as 0
or
1min
) are only allowed because they may sometimes be
useful for testing. While a setting as high as 60d
is
allowed, please note that in many workloads extreme bloat or
transaction ID wraparound may occur in much shorter time frames.
When this feature is enabled, freed space at the end of a relation
cannot be released to the operating system, since that could remove
information needed to detect the “snapshot too old”
condition. All space allocated to a relation remains associated with
that relation for reuse only within that relation unless explicitly
freed (for example, with VACUUM FULL
).
This setting does not attempt to guarantee that an error will be generated under any particular circumstances. In fact, if the correct results can be generated from (for example) a cursor which has materialized a result set, no error will be generated even if the underlying rows in the referenced table have been vacuumed away. Some tables cannot safely be vacuumed early, and so will not be affected by this setting, such as system catalogs. For such tables this setting will neither reduce bloat nor create a possibility of a “snapshot too old” error on scanning.
For additional information on tuning these settings, see Section 30.5.
wal_level
(enum
)
wal_level
determines how much information is written to
the WAL. The default value is replica
, which writes enough
data to support WAL archiving and replication, including running
read-only queries on a standby server. minimal
removes all
logging except the information required to recover from a crash or
immediate shutdown. Finally,
logical
adds information necessary to support logical
decoding. Each level includes the information logged at all lower
levels. This parameter can only be set at server start.
The minimal
level generates the least WAL
volume. It logs no row information for permanent relations
in transactions that create or
rewrite them. This can make operations much faster (see
Section 14.4.7). Operations that initiate this
optimization include:
ALTER ... SET TABLESPACE |
CLUSTER |
CREATE TABLE |
REFRESH MATERIALIZED VIEW
(without CONCURRENTLY ) |
REINDEX |
TRUNCATE |
However, minimal WAL does not contain sufficient information for
point-in-time recovery, so replica
or
higher must be used to enable continuous archiving
(archive_mode) and streaming binary replication.
In fact, the server will not even start in this mode if
max_wal_senders
is non-zero.
Note that changing wal_level
to
minimal
makes previous base backups unusable
for point-in-time recovery and standby servers.
In logical
level, the same information is logged as
with replica
, plus information needed to
extract logical change sets from the WAL. Using a level of
logical
will increase the WAL volume, particularly if many
tables are configured for REPLICA IDENTITY FULL
and
many UPDATE
and DELETE
statements are
executed.
In releases prior to 9.6, this parameter also allowed the
values archive
and hot_standby
.
These are still accepted but mapped to replica
.
fsync
(boolean
)
If this parameter is on, the PostgreSQL server
will try to make sure that updates are physically written to
disk, by issuing fsync()
system calls or various
equivalent methods (see wal_sync_method).
This ensures that the database cluster can recover to a
consistent state after an operating system or hardware crash.
While turning off fsync
is often a performance
benefit, this can result in unrecoverable data corruption in
the event of a power failure or system crash. Thus it
is only advisable to turn off fsync
if
you can easily recreate your entire database from external
data.
Examples of safe circumstances for turning off
fsync
include the initial loading of a new
database cluster from a backup file, using a database cluster
for processing a batch of data after which the database
will be thrown away and recreated,
or for a read-only database clone which
gets recreated frequently and is not used for failover. High
quality hardware alone is not a sufficient justification for
turning off fsync
.
For reliable recovery when changing fsync
off to on, it is necessary to force all modified buffers in the
kernel to durable storage. This can be done while the cluster
is shutdown or while fsync
is on by running initdb
--sync-only
, running sync
, unmounting the
file system, or rebooting the server.
In many situations, turning off synchronous_commit
for noncritical transactions can provide much of the potential
performance benefit of turning off fsync
, without
the attendant risks of data corruption.
fsync
can only be set in the postgresql.conf
file or on the server command line.
If you turn this parameter off, also consider turning off
full_page_writes.
synchronous_commit
(enum
)
Specifies how much WAL processing must complete before
the database server returns a “success”
indication to the client. Valid values are
remote_apply
, on
(the default), remote_write
,
local
, and off
.
If synchronous_standby_names
is empty,
the only meaningful settings are on
and
off
; remote_apply
,
remote_write
and local
all provide the same local synchronization level
as on
. The local behavior of all
non-off
modes is to wait for local flush of WAL
to disk. In off
mode, there is no waiting,
so there can be a delay between when success is reported to the
client and when the transaction is later guaranteed to be safe
against a server crash. (The maximum
delay is three times wal_writer_delay.) Unlike
fsync, setting this parameter to off
does not create any risk of database inconsistency: an operating
system or database crash might
result in some recent allegedly-committed transactions being lost, but
the database state will be just the same as if those transactions had
been aborted cleanly. So, turning synchronous_commit
off
can be a useful alternative when performance is more important than
exact certainty about the durability of a transaction. For more
discussion see Section 30.4.
If synchronous_standby_names is non-empty,
synchronous_commit
also controls whether
transaction commits will wait for their WAL records to be
processed on the standby server(s).
When set to remote_apply
, commits will wait
until replies from the current synchronous standby(s) indicate they
have received the commit record of the transaction and applied
it, so that it has become visible to queries on the standby(s),
and also written to durable storage on the standbys. This will
cause much larger commit delays than previous settings since
it waits for WAL replay. When set to on
,
commits wait until replies
from the current synchronous standby(s) indicate they have received
the commit record of the transaction and flushed it to durable storage. This
ensures the transaction will not be lost unless both the primary and
all synchronous standbys suffer corruption of their database storage.
When set to remote_write
, commits will wait until replies
from the current synchronous standby(s) indicate they have
received the commit record of the transaction and written it to
their file systems. This setting ensures data preservation if a standby instance of
PostgreSQL crashes, but not if the standby
suffers an operating-system-level crash because the data has not
necessarily reached durable storage on the standby.
The setting local
causes commits to wait for
local flush to disk, but not for replication. This is usually not
desirable when synchronous replication is in use, but is provided for
completeness.
This parameter can be changed at any time; the behavior for any
one transaction is determined by the setting in effect when it
commits. It is therefore possible, and useful, to have some
transactions commit synchronously and others asynchronously.
For example, to make a single multistatement transaction commit
asynchronously when the default is the opposite, issue SET
LOCAL synchronous_commit TO OFF
within the transaction.
Table 20.1 summarizes the
capabilities of the synchronous_commit
settings.
Table 20.1. synchronous_commit Modes
synchronous_commit setting | local durable commit | standby durable commit after PG crash | standby durable commit after OS crash | standby query consistency |
---|---|---|---|---|
remote_apply | • | • | • | • |
on | • | • | • | |
remote_write | • | • | ||
local | • | |||
off |
wal_sync_method
(enum
)
Method used for forcing WAL updates out to disk.
If fsync
is off then this setting is irrelevant,
since WAL file updates will not be forced out at all.
Possible values are:
open_datasync
(write WAL files with open()
option O_DSYNC
)
fdatasync
(call fdatasync()
at each commit)
fsync
(call fsync()
at each commit)
fsync_writethrough
(call fsync()
at each commit, forcing write-through of any disk write cache)
open_sync
(write WAL files with open()
option O_SYNC
)
The open_
* options also use O_DIRECT
if available.
Not all of these choices are available on all platforms.
The default is the first method in the above list that is supported
by the platform, except that fdatasync
is the default on
Linux and FreeBSD. The default is not necessarily ideal; it might be
necessary to change this setting or other aspects of your system
configuration in order to create a crash-safe configuration or
achieve optimal performance.
These aspects are discussed in Section 30.1.
This parameter can only be set in the postgresql.conf
file or on the server command line.
full_page_writes
(boolean
)
When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. This is needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)
Turning this parameter off speeds normal operation, but
might lead to either unrecoverable data corruption, or silent
data corruption, after a system failure. The risks are similar to turning off
fsync
, though smaller, and it should be turned off
only based on the same circumstances recommended for that parameter.
Turning off this parameter does not affect use of WAL archiving for point-in-time recovery (PITR) (see Section 26.3).
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is on
.
wal_log_hints
(boolean
)
When this parameter is on
, the PostgreSQL
server writes the entire content of each disk page to WAL during the
first modification of that page after a checkpoint, even for
non-critical modifications of so-called hint bits.
If data checksums are enabled, hint bit updates are always WAL-logged and this setting is ignored. You can use this setting to test how much extra WAL-logging would occur if your database had data checksums enabled.
This parameter can only be set at server start. The default value is off
.
wal_compression
(boolean
)
When this parameter is on
, the PostgreSQL
server compresses full page images written to WAL when
full_page_writes is on or during a base backup.
A compressed page image will be decompressed during WAL replay.
The default value is off
.
Only superusers can change this setting.
Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay.
wal_init_zero
(boolean
)
If set to on
(the default), this option causes new
WAL files to be filled with zeroes. On some file systems, this ensures
that space is allocated before we need to write WAL records. However,
Copy-On-Write (COW) file systems may not benefit
from this technique, so the option is given to skip the unnecessary
work. If set to off
, only the final byte is written
when the file is created so that it has the expected size.
wal_recycle
(boolean
)
If set to on
(the default), this option causes WAL
files to be recycled by renaming them, avoiding the need to create new
ones. On COW file systems, it may be faster to create new ones, so the
option is given to disable this behavior.
wal_buffers
(integer
)
The amount of shared memory used for WAL data that has not yet been
written to disk. The default setting of -1 selects a size equal to
1/32nd (about 3%) of shared_buffers, but not less
than 64kB
nor more than the size of one WAL
segment, typically 16MB
. This value can be set
manually if the automatic choice is too large or too small,
but any positive value less than 32kB
will be
treated as 32kB
.
If this value is specified without units, it is taken as WAL blocks,
that is XLOG_BLCKSZ
bytes, typically 8kB.
This parameter can only be set at server start.
The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.
wal_writer_delay
(integer
)
Specifies how often the WAL writer flushes WAL, in time terms.
After flushing WAL the writer sleeps for the length of time given
by wal_writer_delay
, unless woken up sooner
by an asynchronously committing transaction. If the last flush
happened less than wal_writer_delay
ago and less
than wal_writer_flush_after
worth of WAL has been
produced since, then WAL is only written to the operating system, not
flushed to disk.
If this value is specified without units, it is taken as milliseconds.
The default value is 200 milliseconds (200ms
). Note that
on many systems, the effective resolution of sleep delays is 10
milliseconds; setting wal_writer_delay
to a value that is
not a multiple of 10 might have the same results as setting it to the
next higher multiple of 10. This parameter can only be set in the
postgresql.conf
file or on the server command line.
wal_writer_flush_after
(integer
)
Specifies how often the WAL writer flushes WAL, in volume terms.
If the last flush happened less
than wal_writer_delay
ago and less
than wal_writer_flush_after
worth of WAL has been
produced since, then WAL is only written to the operating system, not
flushed to disk. If wal_writer_flush_after
is set
to 0
then WAL data is always flushed immediately.
If this value is specified without units, it is taken as WAL blocks,
that is XLOG_BLCKSZ
bytes, typically 8kB.
The default is 1MB
.
This parameter can only be set in the
postgresql.conf
file or on the server command line.
wal_skip_threshold
(integer
)
When wal_level
is minimal
and a
transaction commits after creating or rewriting a permanent relation,
this setting determines how to persist the new data. If the data is
smaller than this setting, write it to the WAL log; otherwise, use an
fsync of affected files. Depending on the properties of your storage,
raising or lowering this value might help if such commits are slowing
concurrent transactions. If this value is specified without units, it
is taken as kilobytes. The default is two megabytes
(2MB
).
commit_delay
(integer
)
Setting commit_delay
adds a time delay
before a WAL flush is initiated. This can improve
group commit throughput by allowing a larger number of transactions
to commit via a single WAL flush, if system load is high enough
that additional transactions become ready to commit within the
given interval. However, it also increases latency by up to the
commit_delay
for each WAL
flush. Because the delay is just wasted if no other transactions
become ready to commit, a delay is only performed if at least
commit_siblings
other transactions are active
when a flush is about to be initiated. Also, no delays are
performed if fsync
is disabled.
If this value is specified without units, it is taken as microseconds.
The default commit_delay
is zero (no delay).
Only superusers can change this setting.
In PostgreSQL releases prior to 9.3,
commit_delay
behaved differently and was much
less effective: it affected only commits, rather than all WAL flushes,
and waited for the entire configured delay even if the WAL flush
was completed sooner. Beginning in PostgreSQL 9.3,
the first process that becomes ready to flush waits for the configured
interval, while subsequent processes wait only until the leader
completes the flush operation.
commit_siblings
(integer
)
Minimum number of concurrent open transactions to require
before performing the commit_delay
delay. A larger
value makes it more probable that at least one other
transaction will become ready to commit during the delay
interval. The default is five transactions.
checkpoint_timeout
(integer
)
Maximum time between automatic WAL checkpoints.
If this value is specified without units, it is taken as seconds.
The valid range is between 30 seconds and one day.
The default is five minutes (5min
).
Increasing this parameter can increase the amount of time needed
for crash recovery.
This parameter can only be set in the postgresql.conf
file or on the server command line.
checkpoint_completion_target
(floating point
)
Specifies the target of checkpoint completion, as a fraction of
total time between checkpoints. The default is 0.9, which spreads the
checkpoint across almost all of the available interval, providing fairly
consistent I/O load while also leaving some time for checkpoint
completion overhead. Reducing this parameter is not recommended because
it causes the checkpoint to complete faster. This results in a higher
rate of I/O during the checkpoint followed by a period of less I/O between
the checkpoint completion and the next scheduled checkpoint. This
parameter can only be set in the postgresql.conf
file
or on the server command line.
checkpoint_flush_after
(integer
)
Whenever more than this amount of data has been
written while performing a checkpoint, attempt to force the
OS to issue these writes to the underlying storage. Doing so will
limit the amount of dirty data in the kernel's page cache, reducing
the likelihood of stalls when an fsync
is issued at the end of the
checkpoint, or when the OS writes data back in larger batches in the
background. Often that will result in greatly reduced transaction
latency, but there also are some cases, especially with workloads
that are bigger than shared_buffers, but smaller
than the OS's page cache, where performance might degrade. This
setting may have no effect on some platforms.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The valid range is
between 0
, which disables forced writeback,
and 2MB
. The default is 256kB
on
Linux, 0
elsewhere. (If BLCKSZ
is not
8kB, the default and maximum values scale proportionally to it.)
This parameter can only be set in the postgresql.conf
file or on the server command line.
checkpoint_warning
(integer
)
Write a message to the server log if checkpoints caused by
the filling of WAL segment files happen closer together
than this amount of time (which suggests that
max_wal_size
ought to be raised).
If this value is specified without units, it is taken as seconds.
The default is 30 seconds (30s
).
Zero disables the warning.
No warnings will be generated if checkpoint_timeout
is less than checkpoint_warning
.
This parameter can only be set in the postgresql.conf
file or on the server command line.
max_wal_size
(integer
)
Maximum size to let the WAL grow during automatic
checkpoints. This is a soft limit; WAL size can exceed
max_wal_size
under special circumstances, such as
heavy load, a failing archive_command
, or a high
wal_keep_size
setting.
If this value is specified without units, it is taken as megabytes.
The default is 1 GB.
Increasing this parameter can increase the amount of time needed for
crash recovery.
This parameter can only be set in the postgresql.conf
file or on the server command line.
min_wal_size
(integer
)
As long as WAL disk usage stays below this setting, old WAL files are
always recycled for future use at a checkpoint, rather than removed.
This can be used to ensure that enough WAL space is reserved to
handle spikes in WAL usage, for example when running large batch
jobs.
If this value is specified without units, it is taken as megabytes.
The default is 80 MB.
This parameter can only be set in the postgresql.conf
file or on the server command line.
archive_mode
(enum
)
When archive_mode
is enabled, completed WAL segments
are sent to archive storage by setting
archive_command. In addition to off
,
to disable, there are two modes: on
, and
always
. During normal operation, there is no
difference between the two modes, but when set to always
the WAL archiver is enabled also during archive recovery or standby
mode. In always
mode, all files restored from the archive
or streamed with streaming replication will be archived (again). See
Section 27.2.9 for details.
archive_mode
and archive_command
are
separate variables so that archive_command
can be
changed without leaving archiving mode.
This parameter can only be set at server start.
archive_mode
cannot be enabled when
wal_level
is set to minimal
.
archive_command
(string
)
The local shell command to execute to archive a completed WAL file
segment. Any %p
in the string is
replaced by the path name of the file to archive, and any
%f
is replaced by only the file name.
(The path name is relative to the working directory of the server,
i.e., the cluster's data directory.)
Use %%
to embed an actual %
character in the
command. It is important for the command to return a zero
exit status only if it succeeds. For more information see
Section 26.3.1.
This parameter can only be set in the postgresql.conf
file or on the server command line. It is ignored unless
archive_mode
was enabled at server start.
If archive_command
is an empty string (the default) while
archive_mode
is enabled, WAL archiving is temporarily
disabled, but the server continues to accumulate WAL segment files in
the expectation that a command will soon be provided. Setting
archive_command
to a command that does nothing but
return true, e.g., /bin/true
(REM
on
Windows), effectively disables
archiving, but also breaks the chain of WAL files needed for
archive recovery, so it should only be used in unusual circumstances.
archive_timeout
(integer
)
The archive_command is only invoked for
completed WAL segments. Hence, if your server generates little WAL
traffic (or has slack periods where it does so), there could be a
long delay between the completion of a transaction and its safe
recording in archive storage. To limit how old unarchived
data can be, you can set archive_timeout
to force the
server to switch to a new WAL segment file periodically. When this
parameter is greater than zero, the server will switch to a new
segment file whenever this amount of time has elapsed since the last
segment file switch, and there has been any database activity,
including a single checkpoint (checkpoints are skipped if there is
no database activity). Note that archived files that are closed
early due to a forced switch are still the same length as completely
full files. Therefore, it is unwise to use a very short
archive_timeout
— it will bloat your archive
storage. archive_timeout
settings of a minute or so are
usually reasonable. You should consider using streaming replication,
instead of archiving, if you want data to be copied off the primary
server more quickly than that.
If this value is specified without units, it is taken as seconds.
This parameter can only be set in the
postgresql.conf
file or on the server command line.
This section describes the settings that apply only for the duration of the recovery. They must be reset for any subsequent recovery you wish to perform.
“Recovery” covers using the server as a standby or for executing a targeted recovery. Typically, standby mode would be used to provide high availability and/or read scalability, whereas a targeted recovery is used to recover from data loss.
To start the server in standby mode, create a file called
standby.signal
in the data directory. The server will enter recovery and will not stop
recovery when the end of archived WAL is reached, but will keep trying to
continue recovery by connecting to the sending server as specified by the
primary_conninfo
setting and/or by fetching new WAL
segments using restore_command
. For this mode, the
parameters from this section and Section 20.6.3 are of interest.
Parameters from Section 20.5.5 will
also be applied but are typically not useful in this mode.
To start the server in targeted recovery mode, create a file called
recovery.signal
in the data directory. If both standby.signal
and
recovery.signal
files are created, standby mode
takes precedence. Targeted recovery mode ends when the archived WAL is
fully replayed, or when recovery_target
is reached.
In this mode, the parameters from both this section and Section 20.5.5 will be used.
restore_command
(string
)
The local shell command to execute to retrieve an archived segment of
the WAL file series. This parameter is required for archive recovery,
but optional for streaming replication.
Any %f
in the string is
replaced by the name of the file to retrieve from the archive,
and any %p
is replaced by the copy destination path name
on the server.
(The path name is relative to the current working directory,
i.e., the cluster's data directory.)
Any %r
is replaced by the name of the file containing the
last valid restart point. That is the earliest file that must be kept
to allow a restore to be restartable, so this information can be used
to truncate the archive to just the minimum required to support
restarting from the current restore. %r
is typically only
used by warm-standby configurations
(see Section 27.2).
Write %%
to embed an actual %
character.
It is important for the command to return a zero exit status only if it succeeds. The command will be asked for file names that are not present in the archive; it must return nonzero when so asked. Examples:
restore_command = 'cp /mnt/server/archivedir/%f "%p"' restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows
An exception is that if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start up.
This parameter can only be set in the postgresql.conf
file or on the server command line.
archive_cleanup_command
(string
)
This optional parameter specifies a shell command that will be executed
at every restartpoint. The purpose of
archive_cleanup_command
is to provide a mechanism for
cleaning up old archived WAL files that are no longer needed by the
standby server.
Any %r
is replaced by the name of the file containing the
last valid restart point.
That is the earliest file that must be kept to allow a
restore to be restartable, and so all files earlier than %r
may be safely removed.
This information can be used to truncate the archive to just the
minimum required to support restart from the current restore.
The pg_archivecleanup module
is often used in archive_cleanup_command
for
single-standby configurations, for example:
archive_cleanup_command = 'pg_archivecleanup /mnt/server/archivedir %r'
Note however that if multiple standby servers are restoring from the
same archive directory, you will need to ensure that you do not delete
WAL files until they are no longer needed by any of the servers.
archive_cleanup_command
would typically be used in a
warm-standby configuration (see Section 27.2).
Write %%
to embed an actual %
character in the
command.
If the command returns a nonzero exit status then a warning log message will be written. An exception is that if the command was terminated by a signal or an error by the shell (such as command not found), a fatal error will be raised.
This parameter can only be set in the postgresql.conf
file or on the server command line.
recovery_end_command
(string
)
This parameter specifies a shell command that will be executed once only
at the end of recovery. This parameter is optional. The purpose of the
recovery_end_command
is to provide a mechanism for cleanup
following replication or recovery.
Any %r
is replaced by the name of the file containing the
last valid restart point, like in archive_cleanup_command.
If the command returns a nonzero exit status then a warning log message will be written and the database will proceed to start up anyway. An exception is that if the command was terminated by a signal or an error by the shell (such as command not found), the database will not proceed with startup.
This parameter can only be set in the postgresql.conf
file or on the server command line.
By default, recovery will recover to the end of the WAL log. The
following parameters can be used to specify an earlier stopping point.
At most one of recovery_target
,
recovery_target_lsn
, recovery_target_name
,
recovery_target_time
, or recovery_target_xid
can be used; if more than one of these is specified in the configuration
file, an error will be raised.
These parameters can only be set at server start.
recovery_target
= 'immediate'
This parameter specifies that recovery should end as soon as a consistent state is reached, i.e., as early as possible. When restoring from an online backup, this means the point where taking the backup ended.
Technically, this is a string parameter, but 'immediate'
is currently the only allowed value.
recovery_target_name
(string
)
This parameter specifies the named restore point (created with
pg_create_restore_point()
) to which recovery will proceed.
recovery_target_time
(timestamp
)
This parameter specifies the time stamp up to which recovery will proceed. The precise stopping point is also influenced by recovery_target_inclusive.
The value of this parameter is a time stamp in the same format
accepted by the timestamp with time zone
data type,
except that you cannot use a time zone abbreviation (unless the
timezone_abbreviations variable has been set
earlier in the configuration file). Preferred style is to use a
numeric offset from UTC, or you can write a full time zone name,
e.g., Europe/Helsinki
not EEST
.
recovery_target_xid
(string
)
This parameter specifies the transaction ID up to which recovery will proceed. Keep in mind that while transaction IDs are assigned sequentially at transaction start, transactions can complete in a different numeric order. The transactions that will be recovered are those that committed before (and optionally including) the specified one. The precise stopping point is also influenced by recovery_target_inclusive.
recovery_target_lsn
(pg_lsn
)
This parameter specifies the LSN of the write-ahead log location up
to which recovery will proceed. The precise stopping point is also
influenced by recovery_target_inclusive. This
parameter is parsed using the system data type
pg_lsn
.
The following options further specify the recovery target, and affect what happens when the target is reached:
recovery_target_inclusive
(boolean
)
Specifies whether to stop just after the specified recovery target
(on
), or just before the recovery target
(off
).
Applies when recovery_target_lsn,
recovery_target_time, or
recovery_target_xid is specified.
This setting controls whether transactions
having exactly the target WAL location (LSN), commit time, or transaction ID, respectively, will
be included in the recovery. Default is on
.
recovery_target_timeline
(string
)
Specifies recovering into a particular timeline. The value can be a
numeric timeline ID or a special value. The value
current
recovers along the same timeline that was
current when the base backup was taken. The
value latest
recovers
to the latest timeline found in the archive, which is useful in
a standby server. latest
is the default.
You usually only need to set this parameter in complex re-recovery situations, where you need to return to a state that itself was reached after a point-in-time recovery. See Section 26.3.5 for discussion.
recovery_target_action
(enum
)
Specifies what action the server should take once the recovery target is
reached. The default is pause
, which means recovery will
be paused. promote
means the recovery process will finish
and the server will start to accept connections.
Finally shutdown
will stop the server after reaching the
recovery target.
The intended use of the pause
setting is to allow queries
to be executed against the database to check if this recovery target
is the most desirable point for recovery.
The paused state can be resumed by
using pg_wal_replay_resume()
(see
Table 9.89), which then
causes recovery to end. If this recovery target is not the
desired stopping point, then shut down the server, change the
recovery target settings to a later target and restart to
continue recovery.
The shutdown
setting is useful to have the instance ready
at the exact replay point desired. The instance will still be able to
replay more WAL records (and in fact will have to replay WAL records
since the last checkpoint next time it is started).
Note that because recovery.signal
will not be
removed when recovery_target_action
is set to shutdown
,
any subsequent start will end with immediate shutdown unless the
configuration is changed or the recovery.signal
file is removed manually.
This setting has no effect if no recovery target is set.
If hot_standby is not enabled, a setting of
pause
will act the same as shutdown
.
If the recovery target is reached while a promotion is ongoing,
a setting of pause
will act the same as
promote
.
In any case, if a recovery target is configured but the archive recovery ends before the target is reached, the server will shut down with a fatal error.
These settings control the behavior of the built-in streaming replication feature (see Section 27.2.5). Servers will be either a primary or a standby server. Primaries can send data, while standbys are always receivers of replicated data. When cascading replication (see Section 27.2.7) is used, standby servers can also be senders, as well as receivers. Parameters are mainly for sending and standby servers, though some parameters have meaning only on the primary server. Settings may vary across the cluster without problems if that is required.
These parameters can be set on any server that is to send replication data to one or more standby servers. The primary is always a sending server, so these parameters must always be set on the primary. The role and meaning of these parameters does not change after a standby becomes the primary.
max_wal_senders
(integer
)
Specifies the maximum number of concurrent connections from standby
servers or streaming base backup clients (i.e., the maximum number of
simultaneously running WAL sender processes). The default is
10
. The value 0
means
replication is disabled. Abrupt disconnection of a streaming client might
leave an orphaned connection slot behind until a timeout is reached,
so this parameter should be set slightly higher than the maximum
number of expected clients so disconnected clients can immediately
reconnect. This parameter can only be set at server start. Also,
wal_level
must be set to
replica
or higher to allow connections from standby
servers.
When running a standby server, you must set this parameter to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
max_replication_slots
(integer
)
Specifies the maximum number of replication slots
(see Section 27.2.6) that the server
can support. The default is 10. This parameter can only be set at
server start.
Setting it to a lower value than the number of currently
existing replication slots will prevent the server from starting.
Also, wal_level
must be set
to replica
or higher to allow replication slots to
be used.
On the subscriber side, specifies how many replication origins (see Chapter 50) can be tracked simultaneously, effectively limiting how many logical replication subscriptions can be created on the server. Setting it to a lower value than the current number of tracked replication origins (reflected in pg_replication_origin_status, not pg_replication_origin) will prevent the server from starting.
wal_keep_size
(integer
)
Specifies the minimum size of past log file segments kept in the
pg_wal
directory, in case a standby server needs to fetch them for streaming
replication. If a standby
server connected to the sending server falls behind by more than
wal_keep_size
megabytes, the sending server might
remove a WAL segment still needed by the standby, in which case the
replication connection will be terminated. Downstream connections
will also eventually fail as a result. (However, the standby
server can recover by fetching the segment from archive, if WAL
archiving is in use.)
This sets only the minimum size of segments retained in
pg_wal
; the system might need to retain more segments
for WAL archival or to recover from a checkpoint. If
wal_keep_size
is zero (the default), the system
doesn't keep any extra segments for standby purposes, so the number
of old WAL segments available to standby servers is a function of
the location of the previous checkpoint and status of WAL
archiving.
If this value is specified without units, it is taken as megabytes.
This parameter can only be set in the
postgresql.conf
file or on the server command line.
max_slot_wal_keep_size
(integer
)
Specify the maximum size of WAL files
that replication
slots are allowed to retain in the pg_wal
directory at checkpoint time.
If max_slot_wal_keep_size
is -1 (the default),
replication slots may retain an unlimited amount of WAL files. Otherwise, if
restart_lsn of a replication slot falls behind the current LSN by more
than the given size, the standby using the slot may no longer be able
to continue replication due to removal of required WAL files. You
can see the WAL availability of replication slots
in pg_replication_slots.
If this value is specified without units, it is taken as megabytes.
This parameter can only be set in the postgresql.conf
file or on the server command line.
wal_sender_timeout
(integer
)
Terminate replication connections that are inactive for longer than this amount of time. This is useful for the sending server to detect a standby crash or network outage. If this value is specified without units, it is taken as milliseconds. The default value is 60 seconds. A value of zero disables the timeout mechanism.
With a cluster distributed across multiple geographic locations, using different values per location brings more flexibility in the cluster management. A smaller value is useful for faster failure detection with a standby having a low-latency network connection, and a larger value helps in judging better the health of a standby if located on a remote location, with a high-latency network connection.
track_commit_timestamp
(boolean
)
Record commit time of transactions. This parameter
can only be set in postgresql.conf
file or on the server
command line. The default value is off
.
These parameters can be set on the primary server that is to send replication data to one or more standby servers. Note that in addition to these parameters, wal_level must be set appropriately on the primary server, and optionally WAL archiving can be enabled as well (see Section 20.5.3). The values of these parameters on standby servers are irrelevant, although you may wish to set them there in preparation for the possibility of a standby becoming the primary.
synchronous_standby_names
(string
)
Specifies a list of standby servers that can support
synchronous replication, as described in
Section 27.2.8.
There will be one or more active synchronous standbys;
transactions waiting for commit will be allowed to proceed after
these standby servers confirm receipt of their data.
The synchronous standbys will be those whose names appear
in this list, and
that are both currently connected and streaming data in real-time
(as shown by a state of streaming
in the
pg_stat_replication
view).
Specifying more than one synchronous standby can allow for very high
availability and protection against data loss.
The name of a standby server for this purpose is the
application_name
setting of the standby, as set in the
standby's connection information. In case of a physical replication
standby, this should be set in the primary_conninfo
setting; the default is the setting of cluster_name
if set, else walreceiver
.
For logical replication, this can be set in the connection
information of the subscription, and it defaults to the
subscription name. For other replication stream consumers,
consult their documentation.
This parameter specifies a list of standby servers using either of the following syntaxes:
[FIRST]num_sync
(standby_name
[, ...] ) ANYnum_sync
(standby_name
[, ...] )standby_name
[, ...]
where num_sync
is
the number of synchronous standbys that transactions need to
wait for replies from,
and standby_name
is the name of a standby server.
FIRST
and ANY
specify the method to choose
synchronous standbys from the listed servers.
The keyword FIRST
, coupled with
num_sync
, specifies a
priority-based synchronous replication and makes transaction commits
wait until their WAL records are replicated to
num_sync
synchronous
standbys chosen based on their priorities. For example, a setting of
FIRST 3 (s1, s2, s3, s4)
will cause each commit to wait for
replies from three higher-priority standbys chosen from standby servers
s1
, s2
, s3
and s4
.
The standbys whose names appear earlier in the list are given higher
priority and will be considered as synchronous. Other standby servers
appearing later in this list represent potential synchronous standbys.
If any of the current synchronous standbys disconnects for whatever
reason, it will be replaced immediately with the next-highest-priority
standby. The keyword FIRST
is optional.
The keyword ANY
, coupled with
num_sync
, specifies a
quorum-based synchronous replication and makes transaction commits
wait until their WAL records are replicated to at least
num_sync
listed standbys.
For example, a setting of ANY 3 (s1, s2, s3, s4)
will cause
each commit to proceed as soon as at least any three standbys of
s1
, s2
, s3
and s4
reply.
FIRST
and ANY
are case-insensitive. If these
keywords are used as the name of a standby server,
its standby_name
must
be double-quoted.
The third syntax was used before PostgreSQL
version 9.6 and is still supported. It's the same as the first syntax
with FIRST
and
num_sync
equal to 1.
For example, FIRST 1 (s1, s2)
and s1, s2
have
the same meaning: either s1
or s2
is chosen
as a synchronous standby.
The special entry *
matches any standby name.
There is no mechanism to enforce uniqueness of standby names. In case of duplicates one of the matching standbys will be considered as higher priority, though exactly which one is indeterminate.
Each standby_name
should have the form of a valid SQL identifier, unless it
is *
. You can use double-quoting if necessary. But note
that standby_name
s are
compared to standby application names case-insensitively, whether
double-quoted or not.
If no synchronous standby names are specified here, then synchronous
replication is not enabled and transaction commits will not wait for
replication. This is the default configuration. Even when
synchronous replication is enabled, individual transactions can be
configured not to wait for replication by setting the
synchronous_commit parameter to
local
or off
.
This parameter can only be set in the postgresql.conf
file or on the server command line.
vacuum_defer_cleanup_age
(integer
)
Specifies the number of transactions by which VACUUM
and
HOT updates
will defer cleanup of dead row versions. The
default is zero transactions, meaning that dead row versions can be
removed as soon as possible, that is, as soon as they are no longer
visible to any open transaction. You may wish to set this to a
non-zero value on a primary server that is supporting hot standby
servers, as described in Section 27.4. This allows
more time for queries on the standby to complete without incurring
conflicts due to early cleanup of rows. However, since the value
is measured in terms of number of write transactions occurring on the
primary server, it is difficult to predict just how much additional
grace time will be made available to standby queries.
This parameter can only be set in the postgresql.conf
file or on the server command line.
You should also consider setting hot_standby_feedback
on standby server(s) as an alternative to using this parameter.
This does not prevent cleanup of dead rows which have reached the age
specified by old_snapshot_threshold
.
These settings control the behavior of a standby server that is to receive replication data. Their values on the primary server are irrelevant.
primary_conninfo
(string
)
Specifies a connection string to be used for the standby server to connect with a sending server. This string is in the format described in Section 34.1.1. If any option is unspecified in this string, then the corresponding environment variable (see Section 34.15) is checked. If the environment variable is not set either, then defaults are used.
The connection string should specify the host name (or address)
of the sending server, as well as the port number if it is not
the same as the standby server's default.
Also specify a user name corresponding to a suitably-privileged role
on the sending server (see
Section 27.2.5.1).
A password needs to be provided too, if the sender demands password
authentication. It can be provided in the
primary_conninfo
string, or in a separate
~/.pgpass
file on the standby server (use
replication
as the database name).
Do not specify a database name in the
primary_conninfo
string.
This parameter can only be set in the postgresql.conf
file or on the server command line.
If this parameter is changed while the WAL receiver process is
running, that process is signaled to shut down and expected to
restart with the new setting (except if primary_conninfo
is an empty string).
This setting has no effect if the server is not in standby mode.
primary_slot_name
(string
)
Optionally specifies an existing replication slot to be used when
connecting to the sending server via streaming replication to control
resource removal on the upstream node
(see Section 27.2.6).
This parameter can only be set in the postgresql.conf
file or on the server command line.
If this parameter is changed while the WAL receiver process is running,
that process is signaled to shut down and expected to restart with the
new setting.
This setting has no effect if primary_conninfo
is not
set or the server is not in standby mode.
promote_trigger_file
(string
)
Specifies a trigger file whose presence ends recovery in the
standby. Even if this value is not set, you can still promote
the standby using pg_ctl promote
or calling
pg_promote()
.
This parameter can only be set in the postgresql.conf
file or on the server command line.
hot_standby
(boolean
)
Specifies whether or not you can connect and run queries during
recovery, as described in Section 27.4.
The default value is on
.
This parameter can only be set at server start. It only has effect
during archive recovery or in standby mode.
max_standby_archive_delay
(integer
)
When Hot Standby is active, this parameter determines how long the
standby server should wait before canceling standby queries that
conflict with about-to-be-applied WAL entries, as described in
Section 27.4.2.
max_standby_archive_delay
applies when WAL data is
being read from WAL archive (and is therefore not current).
If this value is specified without units, it is taken as milliseconds.
The default is 30 seconds.
A value of -1 allows the standby to wait forever for conflicting
queries to complete.
This parameter can only be set in the postgresql.conf
file or on the server command line.
Note that max_standby_archive_delay
is not the same as the
maximum length of time a query can run before cancellation; rather it
is the maximum total time allowed to apply any one WAL segment's data.
Thus, if one query has resulted in significant delay earlier in the
WAL segment, subsequent conflicting queries will have much less grace
time.
max_standby_streaming_delay
(integer
)
When Hot Standby is active, this parameter determines how long the
standby server should wait before canceling standby queries that
conflict with about-to-be-applied WAL entries, as described in
Section 27.4.2.
max_standby_streaming_delay
applies when WAL data is
being received via streaming replication.
If this value is specified without units, it is taken as milliseconds.
The default is 30 seconds.
A value of -1 allows the standby to wait forever for conflicting
queries to complete.
This parameter can only be set in the postgresql.conf
file or on the server command line.
Note that max_standby_streaming_delay
is not the same as
the maximum length of time a query can run before cancellation; rather
it is the maximum total time allowed to apply WAL data once it has
been received from the primary server. Thus, if one query has
resulted in significant delay, subsequent conflicting queries will
have much less grace time until the standby server has caught up
again.
wal_receiver_create_temp_slot
(boolean
)
Specifies whether the WAL receiver process should create a temporary replication
slot on the remote instance when no permanent replication slot to use
has been configured (using primary_slot_name).
The default is off. This parameter can only be set in the
postgresql.conf
file or on the server command line.
If this parameter is changed while the WAL receiver process is running,
that process is signaled to shut down and expected to restart with
the new setting.
wal_receiver_status_interval
(integer
)
Specifies the minimum frequency for the WAL receiver
process on the standby to send information about replication progress
to the primary or upstream standby, where it can be seen using the
pg_stat_replication
view. The standby will report
the last write-ahead log location it has written, the last position it
has flushed to disk, and the last position it has applied.
This parameter's value is the maximum amount of time between reports.
Updates are sent each time the write or flush positions change, or as
often as specified by this parameter if set to a non-zero value.
There are additional cases where updates are sent while ignoring this
parameter; for example, when processing of the existing WAL completes
or when synchronous_commit
is set to
remote_apply
.
Thus, the apply position may lag slightly behind the true position.
If this value is specified without units, it is taken as seconds.
The default value is 10 seconds. This parameter can only be set in
the postgresql.conf
file or on the server
command line.
hot_standby_feedback
(boolean
)
Specifies whether or not a hot standby will send feedback to the primary
or upstream standby
about queries currently executing on the standby. This parameter can
be used to eliminate query cancels caused by cleanup records, but
can cause database bloat on the primary for some workloads.
Feedback messages will not be sent more frequently than once per
wal_receiver_status_interval
. The default value is
off
. This parameter can only be set in the
postgresql.conf
file or on the server command line.
If cascaded replication is in use the feedback is passed upstream until it eventually reaches the primary. Standbys make no other use of feedback they receive other than to pass upstream.
This setting does not override the behavior of
old_snapshot_threshold
on the primary; a snapshot on the
standby which exceeds the primary's age threshold can become invalid,
resulting in cancellation of transactions on the standby. This is
because old_snapshot_threshold
is intended to provide an
absolute limit on the time which dead rows can contribute to bloat,
which would otherwise be violated because of the configuration of a
standby.
wal_receiver_timeout
(integer
)
Terminate replication connections that are inactive for longer
than this amount of time. This is useful for
the receiving standby server to detect a primary node crash or network
outage.
If this value is specified without units, it is taken as milliseconds.
The default value is 60 seconds.
A value of zero disables the timeout mechanism.
This parameter can only be set in
the postgresql.conf
file or on the server
command line.
wal_retrieve_retry_interval
(integer
)
Specifies how long the standby server should wait when WAL data is not
available from any sources (streaming replication,
local pg_wal
or WAL archive) before trying
again to retrieve WAL data.
If this value is specified without units, it is taken as milliseconds.
The default value is 5 seconds.
This parameter can only be set in
the postgresql.conf
file or on the server
command line.
This parameter is useful in configurations where a node in recovery needs to control the amount of time to wait for new WAL data to be available. For example, in archive recovery, it is possible to make the recovery more responsive in the detection of a new WAL log file by reducing the value of this parameter. On a system with low WAL activity, increasing it reduces the amount of requests necessary to access WAL archives, something useful for example in cloud environments where the amount of times an infrastructure is accessed is taken into account.
recovery_min_apply_delay
(integer
)
By default, a standby server restores WAL records from the
sending server as soon as possible. It may be useful to have a time-delayed
copy of the data, offering opportunities to correct data loss errors.
This parameter allows you to delay recovery by a specified amount
of time. For example, if
you set this parameter to 5min
, the standby will
replay each transaction commit only when the system time on the standby
is at least five minutes past the commit time reported by the primary.
If this value is specified without units, it is taken as milliseconds.
The default is zero, adding no delay.
It is possible that the replication delay between servers exceeds the value of this parameter, in which case no delay is added. Note that the delay is calculated between the WAL time stamp as written on primary and the current time on the standby. Delays in transfer because of network lag or cascading replication configurations may reduce the actual wait time significantly. If the system clocks on primary and standby are not synchronized, this may lead to recovery applying records earlier than expected; but that is not a major issue because useful settings of this parameter are much larger than typical time deviations between servers.
The delay occurs only on WAL records for transaction commits. Other records are replayed as quickly as possible, which is not a problem because MVCC visibility rules ensure their effects are not visible until the corresponding commit record is applied.
The delay occurs once the database in recovery has reached a consistent state, until the standby is promoted or triggered. After that the standby will end recovery without further waiting.
This parameter is intended for use with streaming replication deployments;
however, if the parameter is specified it will be honored in all cases
except crash recovery.
hot_standby_feedback
will be delayed by use of this feature
which could lead to bloat on the primary; use both together with care.
Synchronous replication is affected by this setting when synchronous_commit
is set to remote_apply
; every COMMIT
will need to wait to be applied.
This parameter can only be set in the postgresql.conf
file or on the server command line.
These settings control the behavior of a logical replication subscriber. Their values on the publisher are irrelevant.
Note that wal_receiver_timeout
,
wal_receiver_status_interval
and
wal_retrieve_retry_interval
configuration parameters
affect the logical replication workers as well.
max_logical_replication_workers
(int
)
Specifies maximum number of logical replication workers. This includes both apply workers and table synchronization workers.
Logical replication workers are taken from the pool defined by
max_worker_processes
.
The default value is 4. This parameter can only be set at server start.
max_sync_workers_per_subscription
(integer
)
Maximum number of synchronization workers per subscription. This parameter controls the amount of parallelism of the initial data copy during the subscription initialization or when new tables are added.
Currently, there can be only one synchronization worker per table.
The synchronization workers are taken from the pool defined by
max_logical_replication_workers
.
The default value is 2. This parameter can only be set in the
postgresql.conf
file or on the server command
line.
These configuration parameters provide a crude method of
influencing the query plans chosen by the query optimizer. If
the default plan chosen by the optimizer for a particular query
is not optimal, a temporary solution is to use one
of these configuration parameters to force the optimizer to
choose a different plan.
Better ways to improve the quality of the
plans chosen by the optimizer include adjusting the planner cost
constants (see Section 20.7.2),
running ANALYZE
manually, increasing
the value of the default_statistics_target configuration parameter,
and increasing the amount of statistics collected for
specific columns using ALTER TABLE SET
STATISTICS
.
enable_async_append
(boolean
)
Enables or disables the query planner's use of async-aware
append plan types. The default is on
.
enable_bitmapscan
(boolean
)
Enables or disables the query planner's use of bitmap-scan plan
types. The default is on
.
enable_gathermerge
(boolean
)
Enables or disables the query planner's use of gather
merge plan types. The default is on
.
enable_hashagg
(boolean
)
Enables or disables the query planner's use of hashed
aggregation plan types. The default is on
.
enable_hashjoin
(boolean
)
Enables or disables the query planner's use of hash-join plan
types. The default is on
.
enable_incremental_sort
(boolean
)
Enables or disables the query planner's use of incremental sort steps.
The default is on
.
enable_indexscan
(boolean
)
Enables or disables the query planner's use of index-scan plan
types. The default is on
.
enable_indexonlyscan
(boolean
)
Enables or disables the query planner's use of index-only-scan plan
types (see Section 11.9).
The default is on
.
enable_material
(boolean
)
Enables or disables the query planner's use of materialization.
It is impossible to suppress materialization entirely,
but turning this variable off prevents the planner from inserting
materialize nodes except in cases where it is required for correctness.
The default is on
.
enable_memoize
(boolean
)
Enables or disables the query planner's use of memoize plans for
caching results from parameterized scans inside nested-loop joins.
This plan type allows scans to the underlying plans to be skipped when
the results for the current parameters are already in the cache. Less
commonly looked up results may be evicted from the cache when more
space is required for new entries. The default is
on
.
enable_mergejoin
(boolean
)
Enables or disables the query planner's use of merge-join plan
types. The default is on
.
enable_nestloop
(boolean
)
Enables or disables the query planner's use of nested-loop join
plans. It is impossible to suppress nested-loop joins entirely,
but turning this variable off discourages the planner from using
one if there are other methods available. The default is
on
.
enable_parallel_append
(boolean
)
Enables or disables the query planner's use of parallel-aware
append plan types. The default is on
.
enable_parallel_hash
(boolean
)
Enables or disables the query planner's use of hash-join plan
types with parallel hash. Has no effect if hash-join plans are not
also enabled. The default is on
.
enable_partition_pruning
(boolean
)
Enables or disables the query planner's ability to eliminate a
partitioned table's partitions from query plans. This also controls
the planner's ability to generate query plans which allow the query
executor to remove (ignore) partitions during query execution. The
default is on
.
See Section 5.11.4 for details.
enable_partitionwise_join
(boolean
)
Enables or disables the query planner's use of partitionwise join,
which allows a join between partitioned tables to be performed by
joining the matching partitions. Partitionwise join currently applies
only when the join conditions include all the partition keys, which
must be of the same data type and have one-to-one matching sets of
child partitions. With this setting enabled, the number of nodes
whose memory usage is restricted by work_mem
appearing in the final plan can increase linearly according to the
number of partitions being scanned. This can result in a large
increase in overall memory consumption during the execution of the
query. Query planning also becomes significantly more expensive in
terms of memory and CPU. The default value is off
.
enable_partitionwise_aggregate
(boolean
)
Enables or disables the query planner's use of partitionwise grouping
or aggregation, which allows grouping or aggregation on partitioned
tables to be performed separately for each partition. If the
GROUP BY
clause does not include the partition
keys, only partial aggregation can be performed on a per-partition
basis, and finalization must be performed later. With this setting
enabled, the number of nodes whose memory usage is restricted by
work_mem
appearing in the final plan can increase
linearly according to the number of partitions being scanned. This
can result in a large increase in overall memory consumption during
the execution of the query. Query planning also becomes significantly
more expensive in terms of memory and CPU. The default value is
off
.
enable_seqscan
(boolean
)
Enables or disables the query planner's use of sequential scan
plan types. It is impossible to suppress sequential scans
entirely, but turning this variable off discourages the planner
from using one if there are other methods available. The
default is on
.
enable_sort
(boolean
)
Enables or disables the query planner's use of explicit sort
steps. It is impossible to suppress explicit sorts entirely,
but turning this variable off discourages the planner from
using one if there are other methods available. The default
is on
.
enable_tidscan
(boolean
)
Enables or disables the query planner's use of TID
scan plan types. The default is on
.
The cost variables described in this section are measured
on an arbitrary scale. Only their relative values matter, hence
scaling them all up or down by the same factor will result in no change
in the planner's choices. By default, these cost variables are based on
the cost of sequential page fetches; that is,
seq_page_cost
is conventionally set to 1.0
and the other cost variables are set with reference to that. But
you can use a different scale if you prefer, such as actual execution
times in milliseconds on a particular machine.
Unfortunately, there is no well-defined method for determining ideal values for the cost variables. They are best treated as averages over the entire mix of queries that a particular installation will receive. This means that changing them on the basis of just a few experiments is very risky.
seq_page_cost
(floating point
)
Sets the planner's estimate of the cost of a disk page fetch that is part of a series of sequential fetches. The default is 1.0. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
random_page_cost
(floating point
)
Sets the planner's estimate of the cost of a non-sequentially-fetched disk page. The default is 4.0. This value can be overridden for tables and indexes in a particular tablespace by setting the tablespace parameter of the same name (see ALTER TABLESPACE).
Reducing this value relative to seq_page_cost
will cause the system to prefer index scans; raising it will
make index scans look relatively more expensive. You can raise
or lower both values together to change the importance of disk I/O
costs relative to CPU costs, which are described by the following
parameters.
Random access to mechanical disk storage is normally much more expensive than four times sequential access. However, a lower default is used (4.0) because the majority of random accesses to disk, such as indexed reads, are assumed to be in cache. The default value can be thought of as modeling random access as 40 times slower than sequential, while expecting 90% of random reads to be cached.
If you believe a 90% cache rate is an incorrect assumption
for your workload, you can increase random_page_cost to better
reflect the true cost of random storage reads. Correspondingly,
if your data is likely to be completely in cache, such as when
the database is smaller than the total server memory, decreasing
random_page_cost can be appropriate. Storage that has a low random
read cost relative to sequential, e.g., solid-state drives, might
also be better modeled with a lower value for random_page_cost,
e.g., 1.1
.
Although the system will let you set random_page_cost
to
less than seq_page_cost
, it is not physically sensible
to do so. However, setting them equal makes sense if the database
is entirely cached in RAM, since in that case there is no penalty
for touching pages out of sequence. Also, in a heavily-cached
database you should lower both values relative to the CPU parameters,
since the cost of fetching a page already in RAM is much smaller
than it would normally be.
cpu_tuple_cost
(floating point
)
Sets the planner's estimate of the cost of processing each row during a query. The default is 0.01.
cpu_index_tuple_cost
(floating point
)
Sets the planner's estimate of the cost of processing each index entry during an index scan. The default is 0.005.
cpu_operator_cost
(floating point
)
Sets the planner's estimate of the cost of processing each operator or function executed during a query. The default is 0.0025.
parallel_setup_cost
(floating point
)
Sets the planner's estimate of the cost of launching parallel worker processes. The default is 1000.
parallel_tuple_cost
(floating point
)
Sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to another process. The default is 0.1.
min_parallel_table_scan_size
(integer
)
Sets the minimum amount of table data that must be scanned in order
for a parallel scan to be considered. For a parallel sequential scan,
the amount of table data scanned is always equal to the size of the
table, but when indexes are used the amount of table data
scanned will normally be less.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The default is 8 megabytes (8MB
).
min_parallel_index_scan_size
(integer
)
Sets the minimum amount of index data that must be scanned in order
for a parallel scan to be considered. Note that a parallel index scan
typically won't touch the entire index; it is the number of pages
which the planner believes will actually be touched by the scan which
is relevant. This parameter is also used to decide whether a
particular index can participate in a parallel vacuum. See
VACUUM.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The default is 512 kilobytes (512kB
).
effective_cache_size
(integer
)
Sets the planner's assumption about the effective size of the
disk cache that is available to a single query. This is
factored into estimates of the cost of using an index; a
higher value makes it more likely index scans will be used, a
lower value makes it more likely sequential scans will be
used. When setting this parameter you should consider both
PostgreSQL's shared buffers and the
portion of the kernel's disk cache that will be used for
PostgreSQL data files, though some
data might exist in both places. Also, take
into account the expected number of concurrent queries on different
tables, since they will have to share the available
space. This parameter has no effect on the size of shared
memory allocated by PostgreSQL, nor
does it reserve kernel disk cache; it is used only for estimation
purposes. The system also does not assume data remains in
the disk cache between queries.
If this value is specified without units, it is taken as blocks,
that is BLCKSZ
bytes, typically 8kB.
The default is 4 gigabytes (4GB
).
(If BLCKSZ
is not 8kB, the default value scales
proportionally to it.)
jit_above_cost
(floating point
)
Sets the query cost above which JIT compilation is activated, if
enabled (see Chapter 32).
Performing JIT costs planning time but can
accelerate query execution.
Setting this to -1
disables JIT compilation.
The default is 100000
.
jit_inline_above_cost
(floating point
)
Sets the query cost above which JIT compilation attempts to inline
functions and operators. Inlining adds planning time, but can
improve execution speed. It is not meaningful to set this to less
than jit_above_cost
.
Setting this to -1
disables inlining.
The default is 500000
.
jit_optimize_above_cost
(floating point
)
Sets the query cost above which JIT compilation applies expensive
optimizations. Such optimization adds planning time, but can improve
execution speed. It is not meaningful to set this to less
than jit_above_cost
, and it is unlikely to be
beneficial to set it to more
than jit_inline_above_cost
.
Setting this to -1
disables expensive optimizations.
The default is 500000
.
The genetic query optimizer (GEQO) is an algorithm that does query planning using heuristic searching. This reduces planning time for complex queries (those joining many relations), at the cost of producing plans that are sometimes inferior to those found by the normal exhaustive-search algorithm. For more information see Chapter 60.
geqo
(boolean
)
Enables or disables genetic query optimization.
This is on by default. It is usually best not to turn it off in
production; the geqo_threshold
variable provides
more granular control of GEQO.
geqo_threshold
(integer
)
Use genetic query optimization to plan queries with at least
this many FROM
items involved. (Note that a
FULL OUTER JOIN
construct counts as only one FROM
item.) The default is 12. For simpler queries it is usually best
to use the regular, exhaustive-search planner, but for queries with
many tables the exhaustive search takes too long, often
longer than the penalty of executing a suboptimal plan. Thus,
a threshold on the size of the query is a convenient way to manage
use of GEQO.
geqo_effort
(integer
)
Controls the trade-off between planning time and query plan quality in GEQO. This variable must be an integer in the range from 1 to 10. The default value is five. Larger values increase the time spent doing query planning, but also increase the likelihood that an efficient query plan will be chosen.
geqo_effort
doesn't actually do anything
directly; it is only used to compute the default values for
the other variables that influence GEQO behavior (described
below). If you prefer, you can set the other parameters by
hand instead.
geqo_pool_size
(integer
)
Controls the pool size used by GEQO, that is the
number of individuals in the genetic population. It must be
at least two, and useful values are typically 100 to 1000. If
it is set to zero (the default setting) then a suitable
value is chosen based on geqo_effort
and
the number of tables in the query.
geqo_generations
(integer
)
Controls the number of generations used by GEQO, that is
the number of iterations of the algorithm. It must
be at least one, and useful values are in the same range as
the pool size. If it is set to zero (the default setting)
then a suitable value is chosen based on
geqo_pool_size
.
geqo_selection_bias
(floating point
)
Controls the selection bias used by GEQO. The selection bias is the selective pressure within the population. Values can be from 1.50 to 2.00; the latter is the default.
geqo_seed
(floating point
)
Controls the initial value of the random number generator used by GEQO to select random paths through the join order search space. The value can range from zero (the default) to one. Varying the value changes the set of join paths explored, and may result in a better or worse best path being found.
default_statistics_target
(integer
)
Sets the default statistics target for table columns without
a column-specific target set via ALTER TABLE
SET STATISTICS
. Larger values increase the time needed to
do ANALYZE
, but might improve the quality of the
planner's estimates. The default is 100. For more information
on the use of statistics by the PostgreSQL
query planner, refer to Section 14.2.
constraint_exclusion
(enum
)
Controls the query planner's use of table constraints to
optimize queries.
The allowed values of constraint_exclusion
are
on
(examine constraints for all tables),
off
(never examine constraints), and
partition
(examine constraints only for inheritance
child tables and UNION ALL
subqueries).
partition
is the default setting.
It is often used with traditional inheritance trees to improve
performance.
When this parameter allows it for a particular table, the planner
compares query conditions with the table's CHECK
constraints, and omits scanning tables for which the conditions
contradict the constraints. For example:
CREATE TABLE parent(key integer, ...); CREATE TABLE child1000(check (key between 1000 and 1999)) INHERITS(parent); CREATE TABLE child2000(check (key between 2000 and 2999)) INHERITS(parent); ... SELECT * FROM parent WHERE key = 2400;
With constraint exclusion enabled, this SELECT
will not scan child1000
at all, improving performance.
Currently, constraint exclusion is enabled by default only for cases that are often used to implement table partitioning via inheritance trees. Turning it on for all tables imposes extra planning overhead that is quite noticeable on simple queries, and most often will yield no benefit for simple queries. If you have no tables that are partitioned using traditional inheritance, you might prefer to turn it off entirely. (Note that the equivalent feature for partitioned tables is controlled by a separate parameter, enable_partition_pruning.)
Refer to Section 5.11.5 for more information on using constraint exclusion to implement partitioning.
cursor_tuple_fraction
(floating point
)
Sets the planner's estimate of the fraction of a cursor's rows that will be retrieved. The default is 0.1. Smaller values of this setting bias the planner towards using “fast start” plans for cursors, which will retrieve the first few rows quickly while perhaps taking a long time to fetch all rows. Larger values put more emphasis on the total estimated time. At the maximum setting of 1.0, cursors are planned exactly like regular queries, considering only the total estimated time and not how soon the first rows might be delivered.
from_collapse_limit
(integer
)
The planner will merge sub-queries into upper queries if the
resulting FROM
list would have no more than
this many items. Smaller values reduce planning time but might
yield inferior query plans. The default is eight.
For more information see Section 14.3.
Setting this value to geqo_threshold or more may trigger use of the GEQO planner, resulting in non-optimal plans. See Section 20.7.3.
jit
(boolean
)
Determines whether JIT compilation may be used by
PostgreSQL, if available (see Chapter 32).
The default is on
.
join_collapse_limit
(integer
)
The planner will rewrite explicit JOIN
constructs (except FULL JOIN
s) into lists of
FROM
items whenever a list of no more than this many items
would result. Smaller values reduce planning time but might
yield inferior query plans.
By default, this variable is set the same as
from_collapse_limit
, which is appropriate
for most uses. Setting it to 1 prevents any reordering of
explicit JOIN
s. Thus, the explicit join order
specified in the query will be the actual order in which the
relations are joined. Because the query planner does not always choose
the optimal join order, advanced users can elect to
temporarily set this variable to 1, and then specify the join
order they desire explicitly.
For more information see Section 14.3.
Setting this value to geqo_threshold or more may trigger use of the GEQO planner, resulting in non-optimal plans. See Section 20.7.3.
plan_cache_mode
(enum
)
Prepared statements (either explicitly prepared or implicitly
generated, for example by PL/pgSQL) can be executed using custom or
generic plans. Custom plans are made afresh for each execution
using its specific set of parameter values, while generic plans do
not rely on the parameter values and can be re-used across
executions. Thus, use of a generic plan saves planning time, but if
the ideal plan depends strongly on the parameter values then a
generic plan may be inefficient. The choice between these options
is normally made automatically, but it can be overridden
with plan_cache_mode
.
The allowed values are auto
(the default),
force_custom_plan
and
force_generic_plan
.
This setting is considered when a cached plan is to be executed,
not when it is prepared.
For more information see PREPARE.
log_destination
(string
)
PostgreSQL supports several methods
for logging server messages, including
stderr, csvlog and
syslog. On Windows,
eventlog is also supported. Set this
parameter to a list of desired log destinations separated by
commas. The default is to log to stderr
only.
This parameter can only be set in the postgresql.conf
file or on the server command line.
If csvlog is included in log_destination
,
log entries are output in “comma separated
value” (CSV) format, which is convenient for
loading logs into programs.
See Section 20.8.4 for details.
logging_collector must be enabled to generate
CSV-format log output.
When either stderr or
csvlog are included, the file
current_logfiles
is created to record the location
of the log file(s) currently in use by the logging collector and the
associated logging destination. This provides a convenient way to
find the logs currently in use by the instance. Here is an example of
this file's content:
stderr log/postgresql.log csvlog log/postgresql.csv
current_logfiles
is recreated when a new log file
is created as an effect of rotation, and
when log_destination
is reloaded. It is removed when
neither stderr
nor csvlog are included
in log_destination
, and when the logging collector is
disabled.
On most Unix systems, you will need to alter the configuration of
your system's syslog daemon in order
to make use of the syslog option for
log_destination
. PostgreSQL
can log to syslog facilities
LOCAL0
through LOCAL7
(see syslog_facility), but the default
syslog configuration on most platforms
will discard all such messages. You will need to add something like:
local0.* /var/log/postgresql
to the syslog daemon's configuration file to make it work.
On Windows, when you use the eventlog
option for log_destination
, you should
register an event source and its library with the operating
system so that the Windows Event Viewer can display event
log messages cleanly.
See Section 19.12 for details.
logging_collector
(boolean
)
This parameter enables the logging collector, which
is a background process that captures log messages
sent to stderr and redirects them into log files.
This approach is often more useful than
logging to syslog, since some types of messages
might not appear in syslog output. (One common
example is dynamic-linker failure messages; another is error messages
produced by scripts such as archive_command
.)
This parameter can only be set at server start.
It is possible to log to stderr without using the logging collector; the log messages will just go to wherever the server's stderr is directed. However, that method is only suitable for low log volumes, since it provides no convenient way to rotate log files. Also, on some platforms not using the logging collector can result in lost or garbled log output, because multiple processes writing concurrently to the same log file can overwrite each other's output.
The logging collector is designed to never lose messages. This means that in case of extremely high load, server processes could be blocked while trying to send additional log messages when the collector has fallen behind. In contrast, syslog prefers to drop messages if it cannot write them, which means it may fail to log some messages in such cases but it will not block the rest of the system.
log_directory
(string
)
When logging_collector
is enabled,
this parameter determines the directory in which log files will be created.
It can be specified as an absolute path, or relative to the
cluster data directory.
This parameter can only be set in the postgresql.conf
file or on the server command line.
The default is log
.
log_filename
(string
)
When logging_collector
is enabled,
this parameter sets the file names of the created log files. The value
is treated as a strftime
pattern,
so %
-escapes can be used to specify time-varying
file names. (Note that if there are
any time-zone-dependent %
-escapes, the computation
is done in the zone specified
by log_timezone.)
The supported %
-escapes are similar to those
listed in the Open Group's strftime
specification.
Note that the system's strftime
is not used
directly, so platform-specific (nonstandard) extensions do not work.
The default is postgresql-%Y-%m-%d_%H%M%S.log
.
If you specify a file name without escapes, you should plan to
use a log rotation utility to avoid eventually filling the
entire disk. In releases prior to 8.4, if
no %
escapes were
present, PostgreSQL would append
the epoch of the new log file's creation time, but this is no
longer the case.
If CSV-format output is enabled in log_destination
,
.csv
will be appended to the timestamped
log file name to create the file name for CSV-format output.
(If log_filename
ends in .log
, the suffix is
replaced instead.)
This parameter can only be set in the postgresql.conf
file or on the server command line.
log_file_mode
(integer
)
On Unix systems this parameter sets the permissions for log files
when logging_collector
is enabled. (On Microsoft
Windows this parameter is ignored.)
The parameter value is expected to be a numeric mode
specified in the format accepted by the
chmod
and umask
system calls. (To use the customary octal format the number
must start with a 0
(zero).)
The default permissions are 0600
, meaning only the
server owner can read or write the log files. The other commonly
useful setting is 0640
, allowing members of the owner's
group to read the files. Note however that to make use of such a
setting, you'll need to alter log_directory to
store the files somewhere outside the cluster data directory. In
any case, it's unwise to make the log files world-readable, since
they might contain sensitive data.
This parameter can only be set in the postgresql.conf
file or on the server command line.
log_rotation_age
(integer
)
When logging_collector
is enabled,
this parameter determines the maximum amount of time to use an
individual log file, after which a new log file will be created.
If this value is specified without units, it is taken as minutes.
The default is 24 hours.
Set to zero to disable time-based creation of new log files.
This parameter can only be set in the postgresql.conf
file or on the server command line.
log_rotation_size
(integer
)
When logging_collector
is enabled,
this parameter determines the maximum size of an individual log file.
After this amount of data has been emitted into a log file,
a new log file will be created.
If this value is specified without units, it is taken as kilobytes.
The default is 10 megabytes.
Set to zero to disable size-based creation of new log files.
This parameter can only be set in the postgresql.conf
file or on the server command line.
log_truncate_on_rotation
(boolean
)
When logging_collector
is enabled,
this parameter will cause PostgreSQL to truncate (overwrite),
rather than append to, any existing log file of the same name.
However, truncation will occur only when a new file is being opened
due to time-based rotation, not during server startup or size-based
rotation. When off, pre-existing files will be appended to in
all cases. For example, using this setting in combination with
a log_filename
like postgresql-%H.log
would result in generating twenty-four hourly log files and then
cyclically overwriting them.
This parameter can only be set in the postgresql.conf
file or on the server command line.
Example: To keep 7 days of logs, one log file per day named
server_log.Mon
, server_log.Tue
,
etc, and automatically overwrite last week's log with this week's log,
set log_filename
to server_log.%a
,
log_truncate_on_rotation
to on
, and
log_rotation_age
to 1440
.
Example: To keep 24 hours of logs, one log file per hour, but
also rotate sooner if the log file size exceeds 1GB, set
log_filename
to server_log.%H%M
,
log_truncate_on_rotation
to on
,
log_rotation_age
to 60
, and
log_rotation_size
to 1000000
.
Including %M
in log_filename
allows
any size-driven rotations that might occur to select a file name
different from the hour's initial file name.
syslog_facility
(enum
)
When logging to syslog is enabled, this parameter
determines the syslog
“facility” to be used. You can choose
from LOCAL0
, LOCAL1
,
LOCAL2
, LOCAL3
, LOCAL4
,
LOCAL5
, LOCAL6
, LOCAL7
;
the default is LOCAL0
. See also the
documentation of your system's
syslog daemon.
This parameter can only be set in the postgresql.conf
file or on the server command line.
syslog_ident
(string
)
When logging to syslog is enabled, this parameter
determines the program name used to identify
PostgreSQL messages in
syslog logs. The default is
postgres
.
This parameter can only be set in the postgresql.conf
file or on the server command line.
syslog_sequence_numbers
(boolean
)
When logging to syslog and this is on (the
default), then each message will be prefixed by an increasing
sequence number (such as [2]
). This circumvents
the “--- last message repeated N times ---” suppression
that many syslog implementations perform by default. In more modern
syslog implementations, repeated message suppression can be configured
(for example, $RepeatedMsgReduction
in rsyslog), so this might not be
necessary. Also, you could turn this off if you actually want to
suppress repeated messages.
This parameter can only be set in the postgresql.conf
file or on the server command line.
syslog_split_messages
(boolean
)
When logging to syslog is enabled, this parameter determines how messages are delivered to syslog. When on (the default), messages are split by lines, and long lines are split so that they will fit into 1024 bytes, which is a typical size limit for traditional syslog implementations. When off, PostgreSQL server log messages are delivered to the syslog service as is, and it is up to the syslog service to cope with the potentially bulky messages.
If syslog is ultimately logging to a text file, then the effect will be the same either way, and it is best to leave the setting on, since most syslog implementations either cannot handle large messages or would need to be specially configured to handle them. But if syslog is ultimately writing into some other medium, it might be necessary or more useful to keep messages logically together.
This parameter can only be set in the postgresql.conf
file or on the server command line.
event_source
(string
)
When logging to event log is enabled, this parameter
determines the program name used to identify
PostgreSQL messages in
the log. The default is PostgreSQL
.
This parameter can only be set in the postgresql.conf
file or on the server command line.
log_min_messages
(enum
)
Controls which message
levels are written to the server log.
Valid values are DEBUG5
, DEBUG4
,
DEBUG3
, DEBUG2
, DEBUG1
,
INFO
, NOTICE
, WARNING
,
ERROR
, LOG
, FATAL
, and
PANIC
. Each level includes all the levels that
follow it. The later the level, the fewer messages are sent
to the log. The default is WARNING
. Note that
LOG
has a different rank here than in
client_min_messages.
Only superusers can change this setting.
log_min_error_statement
(enum
)
Controls which SQL statements that cause an error
condition are recorded in the server log. The current
SQL statement is included in the log entry for any message of
the specified
severity
or higher.
Valid values are DEBUG5
,
DEBUG4
, DEBUG3
,
DEBUG2
, DEBUG1
,
INFO
, NOTICE
,
WARNING
, ERROR
,
LOG
,
FATAL
, and PANIC
.
The default is ERROR
, which means statements
causing errors, log messages, fatal errors, or panics will be logged.
To effectively turn off logging of failing statements,
set this parameter to PANIC
.
Only superusers can change this setting.
log_min_duration_statement
(integer
)
Causes the duration of each completed statement to be logged
if the statement ran for at least the specified amount of time.
For example, if you set it to 250ms
then all SQL statements that run 250ms or longer will be
logged. Enabling this parameter can be helpful in tracking down
unoptimized queries in your applications.
If this value is specified without units, it is taken as milliseconds.
Setting this to zero prints all statement durations.
-1
(the default) disables logging statement
durations. Only superusers can change this setting.
This overrides log_min_duration_sample, meaning that queries with duration exceeding this setting are not subject to sampling and are always logged.
For clients using extended query protocol, durations of the Parse, Bind, and Execute steps are logged independently.
When using this option together with
log_statement,
the text of statements that are logged because of
log_statement
will not be repeated in the
duration log message.
If you are not using syslog, it is recommended
that you log the PID or session ID using
log_line_prefix
so that you can link the statement message to the later
duration message using the process ID or session ID.
log_min_duration_sample
(integer
)
Allows sampling the duration of completed statements that ran for
at least the specified amount of time. This produces the same
kind of log entries as
log_min_duration_statement, but only for a
subset of the executed statements, with sample rate controlled by
log_statement_sample_rate.
For example, if you set it to 100ms
then all
SQL statements that run 100ms or longer will be considered for
sampling. Enabling this parameter can be helpful when the
traffic is too high to log all queries.
If this value is specified without units, it is taken as milliseconds.
Setting this to zero samples all statement durations.
-1
(the default) disables sampling statement
durations. Only superusers can change this setting.
This setting has lower priority
than log_min_duration_statement
, meaning that
statements with durations
exceeding log_min_duration_statement
are not
subject to sampling and are always logged.
Other notes for log_min_duration_statement
apply also to this setting.
log_statement_sample_rate
(floating point
)
Determines the fraction of statements with duration exceeding
log_min_duration_sample that will be logged.
Sampling is stochastic, for example 0.5
means
there is statistically one chance in two that any given statement
will be logged.
The default is 1.0
, meaning to log all sampled
statements.
Setting this to zero disables sampled statement-duration logging,
the same as setting
log_min_duration_sample
to
-1
.
Only superusers can change this setting.
log_transaction_sample_rate
(floating point
)
Sets the fraction of transactions whose statements are all logged,
in addition to statements logged for other reasons. It applies to
each new transaction regardless of its statements' durations.
Sampling is stochastic, for example 0.1
means
there is statistically one chance in ten that any given transaction
will be logged.
log_transaction_sample_rate
can be helpful to
construct a sample of transactions.
The default is 0
, meaning not to log
statements from any additional transactions. Setting this
to 1
logs all statements of all transactions.
Only superusers can change this setting.
Like all statement-logging options, this option can add significant overhead.
Table 20.2 explains the message severity levels used by PostgreSQL. If logging output is sent to syslog or Windows' eventlog, the severity levels are translated as shown in the table.
Table 20.2. Message Severity Levels
Severity | Usage | syslog | eventlog |
---|---|---|---|
DEBUG1 .. DEBUG5 | Provides successively-more-detailed information for use by developers. | DEBUG | INFORMATION |
INFO | Provides information implicitly requested by the user,
e.g., output from VACUUM VERBOSE . | INFO | INFORMATION |
NOTICE | Provides information that might be helpful to users, e.g., notice of truncation of long identifiers. | NOTICE | INFORMATION |
WARNING | Provides warnings of likely problems, e.g., COMMIT
outside a transaction block. | NOTICE | WARNING |
ERROR | Reports an error that caused the current command to abort. | WARNING | ERROR |
LOG | Reports information of interest to administrators, e.g., checkpoint activity. | INFO | INFORMATION |
FATAL | Reports an error that caused the current session to abort. | ERR | ERROR |
PANIC | Reports an error that caused all database sessions to abort. | CRIT | ERROR |
What you choose to log can have security implications; see Section 25.3.
application_name
(string
)
The application_name
can be any string of less than
NAMEDATALEN
characters (64 characters in a standard build).
It is typically set by an application upon connection to the server.
The name will be displayed in the pg_stat_activity
view
and included in CSV log entries. It can also be included in regular
log entries via the log_line_prefix parameter.
Only printable ASCII characters may be used in the
application_name
value. Other characters will be
replaced with question marks (?
).
debug_print_parse
(boolean
)
debug_print_rewritten
(boolean
)
debug_print_plan
(boolean
)
These parameters enable various debugging output to be emitted.
When set, they print the resulting parse tree, the query rewriter
output, or the execution plan for each executed query.
These messages are emitted at LOG
message level, so by
default they will appear in the server log but will not be sent to the
client. You can change that by adjusting
client_min_messages and/or
log_min_messages.
These parameters are off by default.
debug_pretty_print
(boolean
)
When set, debug_pretty_print
indents the messages
produced by debug_print_parse
,
debug_print_rewritten
, or
debug_print_plan
. This results in more readable
but much longer output than the “compact” format used when
it is off. It is on by default.
log_autovacuum_min_duration
(integer
)
Causes each action executed by autovacuum to be logged if it ran for at
least the specified amount of time. Setting this to zero logs
all autovacuum actions. -1
(the default) disables
logging autovacuum actions.
If this value is specified without units, it is taken as milliseconds.
For example, if you set this to
250ms
then all automatic vacuums and analyzes that run
250ms or longer will be logged. In addition, when this parameter is
set to any value other than -1
, a message will be
logged if an autovacuum action is skipped due to a conflicting lock or a
concurrently dropped relation. Enabling this parameter can be helpful
in tracking autovacuum activity. This parameter can only be set in
the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
log_checkpoints
(boolean
)
Causes checkpoints and restartpoints to be logged in the server log.
Some statistics are included in the log messages, including the number
of buffers written and the time spent writing them.
This parameter can only be set in the postgresql.conf
file or on the server command line. The default is off.
log_connections
(boolean
)
Causes each attempted connection to the server to be logged,
as well as successful completion of both client authentication (if
necessary) and authorization.
Only superusers can change this parameter at session start,
and it cannot be changed at all within a session.
The default is off
.
Some client programs, like psql, attempt to connect twice while determining if a password is required, so duplicate “connection received” messages do not necessarily indicate a problem.
log_disconnections
(boolean
)
Causes session terminations to be logged. The log output
provides information similar to log_connections
,
plus the duration of the session.
Only superusers can change this parameter at session start,
and it cannot be changed at all within a session.
The default is off
.
log_duration
(boolean
)
Causes the duration of every completed statement to be logged.
The default is off
.
Only superusers can change this setting.
For clients using extended query protocol, durations of the Parse, Bind, and Execute steps are logged independently.
The difference between enabling log_duration
and setting
log_min_duration_statement to zero is that
exceeding log_min_duration_statement
forces the text of
the query to be logged, but this option doesn't. Thus, if
log_duration
is on
and
log_min_duration_statement
has a positive value, all
durations are logged but the query text is included only for
statements exceeding the threshold. This behavior can be useful for
gathering statistics in high-load installations.
log_error_verbosity
(enum
)
Controls the amount of detail written in the server log for each
message that is logged. Valid values are TERSE
,
DEFAULT
, and VERBOSE
, each adding more
fields to displayed messages. TERSE
excludes
the logging of DETAIL
, HINT
,
QUERY
, and CONTEXT
error information.
VERBOSE
output includes the SQLSTATE
error
code (see also Appendix A) and the source code file name, function name,
and line number that generated the error.
Only superusers can change this setting.
log_hostname
(boolean
)
By default, connection log messages only show the IP address of the
connecting host. Turning this parameter on causes logging of the
host name as well. Note that depending on your host name resolution
setup this might impose a non-negligible performance penalty.
This parameter can only be set in the postgresql.conf
file or on the server command line.
log_line_prefix
(string
)
This is a printf
-style string that is output at the
beginning of each log line.
%
characters begin “escape sequences”
that are replaced with status information as outlined below.
Unrecognized escapes are ignored. Other
characters are copied straight to the log line. Some escapes are
only recognized by session processes, and will be treated as empty by
background processes such as the main server process. Status
information may be aligned either left or right by specifying a
numeric literal after the % and before the option. A negative
value will cause the status information to be padded on the
right with spaces to give it a minimum width, whereas a positive
value will pad on the left. Padding can be useful to aid human
readability in log files.
This parameter can only be set in the postgresql.conf
file or on the server command line. The default is
'%m [%p] '
which logs a time stamp and the process ID.
Escape | Effect | Session only |
---|---|---|
%a | Application name | yes |
%u | User name | yes |
%d | Database name | yes |
%r | Remote host name or IP address, and remote port | yes |
%h | Remote host name or IP address | yes |
%b | Backend type | no |
%p | Process ID | no |
%P | Process ID of the parallel group leader, if this process is a parallel query worker | no |
%t | Time stamp without milliseconds | no |
%m | Time stamp with milliseconds | no |
%n | Time stamp with milliseconds (as a Unix epoch) | no |
%i | Command tag: type of session's current command | yes |
%e | SQLSTATE error code | no |
%c | Session ID: see below | no |
%l | Number of the log line for each session or process, starting at 1 | no |
%s | Process start time stamp | no |
%v | Virtual transaction ID (backendID/localXID) | no |
%x | Transaction ID (0 if none is assigned) | no |
%q | Produces no output, but tells non-session processes to stop at this point in the string; ignored by session processes | no |
%Q | Query identifier of the current query. Query identifiers are not computed by default, so this field will be zero unless compute_query_id parameter is enabled or a third-party module that computes query identifiers is configured. | yes |
%% | Literal % | no |
The backend type corresponds to the column
backend_type
in the view
pg_stat_activity
,
but additional types can appear
in the log that don't show in that view.
The %c
escape prints a quasi-unique session identifier,
consisting of two 4-byte hexadecimal numbers (without leading zeros)
separated by a dot. The numbers are the process start time and the
process ID, so %c
can also be used as a space saving way
of printing those items. For example, to generate the session
identifier from pg_stat_activity
, use this query:
SELECT to_hex(trunc(EXTRACT(EPOCH FROM backend_start))::integer) || '.' || to_hex(pid) FROM pg_stat_activity;
If you set a nonempty value for log_line_prefix
,
you should usually make its last character be a space, to provide
visual separation from the rest of the log line. A punctuation
character can be used too.
Syslog produces its own time stamp and process ID information, so you probably do not want to include those escapes if you are logging to syslog.
The %q
escape is useful when including information that is
only available in session (backend) context like user or database
name. For example:
log_line_prefix = '%m [%p] %q%u@%d/%a '
The %Q
escape always reports a zero identifier
for lines output by log_statement because
log_statement
generates output before an
identifier can be calculated, including invalid statements for
which an identifier cannot be calculated.
log_lock_waits
(boolean
)
Controls whether a log message is produced when a session waits
longer than deadlock_timeout to acquire a
lock. This is useful in determining if lock waits are causing
poor performance. The default is off
.
Only superusers can change this setting.
log_recovery_conflict_waits
(boolean
)
Controls whether a log message is produced when the startup process
waits longer than deadlock_timeout
for recovery conflicts. This is useful in determining if recovery
conflicts prevent the recovery from applying WAL.
The default is off
. This parameter can only be set
in the postgresql.conf
file or on the server
command line.
log_parameter_max_length
(integer
)
If greater than zero, each bind parameter value logged with a
non-error statement-logging message is trimmed to this many bytes.
Zero disables logging of bind parameters for non-error statement logs.
-1
(the default) allows bind parameters to be
logged in full.
If this value is specified without units, it is taken as bytes.
Only superusers can change this setting.
This setting only affects log messages printed as a result of log_statement, log_duration, and related settings. Non-zero values of this setting add some overhead, particularly if parameters are sent in binary form, since then conversion to text is required.
log_parameter_max_length_on_error
(integer
)
If greater than zero, each bind parameter value reported in error
messages is trimmed to this many bytes.
Zero (the default) disables including bind parameters in error
messages.
-1
allows bind parameters to be printed in full.
If this value is specified without units, it is taken as bytes.
Non-zero values of this setting add overhead, as PostgreSQL will need to store textual representations of parameter values in memory at the start of each statement, whether or not an error eventually occurs. The overhead is greater when bind parameters are sent in binary form than when they are sent as text, since the former case requires data conversion while the latter only requires copying the string.
log_statement
(enum
)
Controls which SQL statements are logged. Valid values are
none
(off), ddl
, mod
, and
all
(all statements). ddl
logs all data definition
statements, such as CREATE
, ALTER
, and
DROP
statements. mod
logs all
ddl
statements, plus data-modifying statements
such as INSERT
,
UPDATE
, DELETE
, TRUNCATE
,
and COPY FROM
.
PREPARE
, EXECUTE
, and
EXPLAIN ANALYZE
statements are also logged if their
contained command is of an appropriate type. For clients using
extended query protocol, logging occurs when an Execute message
is received, and values of the Bind parameters are included
(with any embedded single-quote marks doubled).
The default is none
. Only superusers can change this
setting.
Statements that contain simple syntax errors are not logged
even by the log_statement
= all
setting,
because the log message is emitted only after basic parsing has
been done to determine the statement type. In the case of extended
query protocol, this setting likewise does not log statements that
fail before the Execute phase (i.e., during parse analysis or
planning). Set log_min_error_statement
to
ERROR
(or lower) to log such statements.
Logged statements might reveal sensitive data and even contain plaintext passwords.
log_replication_commands
(boolean
)
Causes each replication command to be logged in the server log.
See Section 53.4 for more information about
replication command. The default value is off
.
Only superusers can change this setting.
log_temp_files
(integer
)
Controls logging of temporary file names and sizes. Temporary files can be created for sorts, hashes, and temporary query results. If enabled by this setting, a log entry is emitted for each temporary file when it is deleted. A value of zero logs all temporary file information, while positive values log only files whose size is greater than or equal to the specified amount of data. If this value is specified without units, it is taken as kilobytes. The default setting is -1, which disables such logging. Only superusers can change this setting.
log_timezone
(string
)
Sets the time zone used for timestamps written in the server log.
Unlike TimeZone, this value is cluster-wide,
so that all sessions will report timestamps consistently.
The built-in default is GMT
, but that is typically
overridden in postgresql.conf
; initdb
will install a setting there corresponding to its system environment.
See Section 8.5.3 for more information.
This parameter can only be set in the postgresql.conf
file or on the server command line.
Including csvlog
in the log_destination
list
provides a convenient way to import log files into a database table.
This option emits log lines in comma-separated-values
(CSV) format,
with these columns:
time stamp with milliseconds,
user name,
database name,
process ID,
client host:port number,
session ID,
per-session line number,
command tag,
session start time,
virtual transaction ID,
regular transaction ID,
error severity,
SQLSTATE code,
error message,
error message detail,
hint,
internal query that led to the error (if any),
character count of the error position therein,
error context,
user query that led to the error (if any and enabled by
log_min_error_statement
),
character count of the error position therein,
location of the error in the PostgreSQL source code
(if log_error_verbosity
is set to verbose
),
application name, backend type, process ID of parallel group leader,
and query id.
Here is a sample table definition for storing CSV-format log output:
CREATE TABLE postgres_log ( log_time timestamp(3) with time zone, user_name text, database_name text, process_id integer, connection_from text, session_id text, session_line_num bigint, command_tag text, session_start_time timestamp with time zone, virtual_transaction_id text, transaction_id bigint, error_severity text, sql_state_code text, message text, detail text, hint text, internal_query text, internal_query_pos integer, context text, query text, query_pos integer, location text, application_name text, backend_type text, leader_pid integer, query_id bigint, PRIMARY KEY (session_id, session_line_num) );
To import a log file into this table, use the COPY FROM
command:
COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
It is also possible to access the file as a foreign table, using the supplied file_fdw module.
There are a few things you need to do to simplify importing CSV log files:
Set log_filename
and
log_rotation_age
to provide a consistent,
predictable naming scheme for your log files. This lets you
predict what the file name will be and know when an individual log
file is complete and therefore ready to be imported.
Set log_rotation_size
to 0 to disable
size-based log rotation, as it makes the log file name difficult
to predict.
Set log_truncate_on_rotation
to on
so
that old log data isn't mixed with the new in the same file.
The table definition above includes a primary key specification.
This is useful to protect against accidentally importing the same
information twice. The COPY
command commits all of the
data it imports at one time, so any error will cause the entire
import to fail. If you import a partial log file and later import
the file again when it is complete, the primary key violation will
cause the import to fail. Wait until the log is complete and
closed before importing. This procedure will also protect against
accidentally importing a partial line that hasn't been completely
written, which would also cause COPY
to fail.
These settings control how process titles of server processes are modified. Process titles are typically viewed using programs like ps or, on Windows, Process Explorer. See Section 28.1 for details.
cluster_name
(string
)
Sets a name that identifies this database cluster (instance) for various purposes. The cluster name appears in the process title for all server processes in this cluster. Moreover, it is the default application name for a standby connection (see synchronous_standby_names.)
The name can be any string of less
than NAMEDATALEN
characters (64 characters in a standard
build). Only printable ASCII characters may be used in the
cluster_name
value. Other characters will be
replaced with question marks (?
). No name is shown
if this parameter is set to the empty string ''
(which is
the default). This parameter can only be set at server start.
update_process_title
(boolean
)
Enables updating of the process title every time a new SQL command
is received by the server.
This setting defaults to on
on most platforms, but it
defaults to off
on Windows due to that platform's larger
overhead for updating the process title.
Only superusers can change this setting.
These parameters control server-wide statistics collection features.
When statistics collection is enabled, the data that is produced can be
accessed via the pg_stat
and
pg_statio
family of system views.
Refer to Chapter 28 for more information.
track_activities
(boolean
)
Enables the collection of information on the currently
executing command of each session, along with its identifier and the
time when that command began execution. This parameter is on by
default. Note that even when enabled, this information is not
visible to all users, only to superusers, roles with privileges of the
pg_read_all_stats
role and the user owning the
sessions being reported on (including sessions belonging to a role they
have the privileges of), so it should not represent a security risk.
Only superusers can change this setting.
track_activity_query_size
(integer
)
Specifies the amount of memory reserved to store the text of the
currently executing command for each active session, for the
pg_stat_activity
.query
field.
If this value is specified without units, it is taken as bytes.
The default value is 1024 bytes.
This parameter can only be set at server start.
track_counts
(boolean
)
Enables collection of statistics on database activity. This parameter is on by default, because the autovacuum daemon needs the collected information. Only superusers can change this setting.
track_io_timing
(boolean
)
Enables timing of database I/O calls. This parameter is off by
default, as it will repeatedly query the operating system for
the current time, which may cause significant overhead on some
platforms. You can use the pg_test_timing tool to
measure the overhead of timing on your system.
I/O timing information is
displayed in
pg_stat_database
, in the output of
EXPLAIN when the BUFFERS
option
is used, by autovacuum for auto-vacuums and auto-analyzes, when
log_autovacuum_min_duration is set and by
pg_stat_statements. Only superusers can change this
setting.
track_wal_io_timing
(boolean
)
Enables timing of WAL I/O calls. This parameter is off by default,
as it will repeatedly query the operating system for the current time,
which may cause significant overhead on some platforms.
You can use the pg_test_timing tool to
measure the overhead of timing on your system.
I/O timing information is
displayed in
pg_stat_wal
. Only superusers can
change this setting.
track_functions
(enum
)
Enables tracking of function call counts and time used. Specify
pl
to track only procedural-language functions,
all
to also track SQL and C language functions.
The default is none
, which disables function
statistics tracking. Only superusers can change this setting.
SQL-language functions that are simple enough to be “inlined” into the calling query will not be tracked, regardless of this setting.
stats_temp_directory
(string
)
Sets the directory to store temporary statistics data in. This can be
a path relative to the data directory or an absolute path. The default
is pg_stat_tmp
. Pointing this at a RAM-based
file system will decrease physical I/O requirements and can lead to
improved performance.
This parameter can only be set in the postgresql.conf
file or on the server command line.
compute_query_id
(enum
)
Enables in-core computation of a query identifier.
Query identifiers can be displayed in the pg_stat_activity
view, using EXPLAIN
, or emitted in the log if
configured via the log_line_prefix parameter.
The pg_stat_statements extension also requires a query
identifier to be computed. Note that an external module can
alternatively be used if the in-core query identifier computation
method is not acceptable. In this case, in-core computation
must be always disabled.
Valid values are off
(always disabled),
on
(always enabled), auto
,
which lets modules such as pg_stat_statements
automatically enable it, and regress
which
has the same effect as auto
, except that the
query identifier is not shown in the EXPLAIN
output
in order to facilitate automated regression testing.
The default is auto
.
To ensure that only one query identifier is calculated and displayed, extensions that calculate query identifiers should throw an error if a query identifier has already been computed.
log_statement_stats
(boolean
)
log_parser_stats
(boolean
)
log_planner_stats
(boolean
)
log_executor_stats
(boolean
)
For each query, output performance statistics of the respective
module to the server log. This is a crude profiling
instrument, similar to the Unix getrusage()
operating
system facility. log_statement_stats
reports total
statement statistics, while the others report per-module statistics.
log_statement_stats
cannot be enabled together with
any of the per-module options. All of these options are disabled by
default. Only superusers can change these settings.
These settings control the behavior of the autovacuum feature. Refer to Section 25.1.6 for more information. Note that many of these settings can be overridden on a per-table basis; see Storage Parameters.
autovacuum
(boolean
)
Controls whether the server should run the
autovacuum launcher daemon. This is on by default; however,
track_counts must also be enabled for
autovacuum to work.
This parameter can only be set in the postgresql.conf
file or on the server command line; however, autovacuuming can be
disabled for individual tables by changing table storage parameters.
Note that even when this parameter is disabled, the system will launch autovacuum processes if necessary to prevent transaction ID wraparound. See Section 25.1.5 for more information.
autovacuum_max_workers
(integer
)
Specifies the maximum number of autovacuum processes (other than the autovacuum launcher) that may be running at any one time. The default is three. This parameter can only be set at server start.
autovacuum_naptime
(integer
)
Specifies the minimum delay between autovacuum runs on any given
database. In each round the daemon examines the
database and issues VACUUM
and ANALYZE
commands
as needed for tables in that database.
If this value is specified without units, it is taken as seconds.
The default is one minute (1min
).
This parameter can only be set in the postgresql.conf
file or on the server command line.
autovacuum_vacuum_threshold
(integer
)
Specifies the minimum number of updated or deleted tuples needed
to trigger a VACUUM
in any one table.
The default is 50 tuples.
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_vacuum_insert_threshold
(integer
)
Specifies the number of inserted tuples needed to trigger a
VACUUM
in any one table.
The default is 1000 tuples. If -1 is specified, autovacuum will not
trigger a VACUUM
operation on any tables based on
the number of inserts.
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_analyze_threshold
(integer
)
Specifies the minimum number of inserted, updated or deleted tuples
needed to trigger an ANALYZE
in any one table.
The default is 50 tuples.
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_vacuum_scale_factor
(floating point
)
Specifies a fraction of the table size to add to
autovacuum_vacuum_threshold
when deciding whether to trigger a VACUUM
.
The default is 0.2 (20% of table size).
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_vacuum_insert_scale_factor
(floating point
)
Specifies a fraction of the table size to add to
autovacuum_vacuum_insert_threshold
when deciding whether to trigger a VACUUM
.
The default is 0.2 (20% of table size).
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_analyze_scale_factor
(floating point
)
Specifies a fraction of the table size to add to
autovacuum_analyze_threshold
when deciding whether to trigger an ANALYZE
.
The default is 0.1 (10% of table size).
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_freeze_max_age
(integer
)
Specifies the maximum age (in transactions) that a table's
pg_class
.relfrozenxid
field can
attain before a VACUUM
operation is forced
to prevent transaction ID wraparound within the table.
Note that the system will launch autovacuum processes to
prevent wraparound even when autovacuum is otherwise disabled.
Vacuum also allows removal of old files from the
pg_xact
subdirectory, which is why the default
is a relatively low 200 million transactions.
This parameter can only be set at server start, but the setting
can be reduced for individual tables by
changing table storage parameters.
For more information see Section 25.1.5.
autovacuum_multixact_freeze_max_age
(integer
)
Specifies the maximum age (in multixacts) that a table's
pg_class
.relminmxid
field can
attain before a VACUUM
operation is forced to
prevent multixact ID wraparound within the table.
Note that the system will launch autovacuum processes to
prevent wraparound even when autovacuum is otherwise disabled.
Vacuuming multixacts also allows removal of old files from the
pg_multixact/members
and pg_multixact/offsets
subdirectories, which is why the default is a relatively low
400 million multixacts.
This parameter can only be set at server start, but the setting can
be reduced for individual tables by changing table storage parameters.
For more information see Section 25.1.5.1.
autovacuum_vacuum_cost_delay
(floating point
)
Specifies the cost delay value that will be used in automatic
VACUUM
operations. If -1 is specified, the regular
vacuum_cost_delay value will be used.
If this value is specified without units, it is taken as milliseconds.
The default value is 2 milliseconds.
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
autovacuum_vacuum_cost_limit
(integer
)
Specifies the cost limit value that will be used in automatic
VACUUM
operations. If -1 is specified (which is the
default), the regular
vacuum_cost_limit value will be used. Note that
the value is distributed proportionally among the running autovacuum
workers, if there is more than one, so that the sum of the limits for
each worker does not exceed the value of this variable.
This parameter can only be set in the postgresql.conf
file or on the server command line;
but the setting can be overridden for individual tables by
changing table storage parameters.
client_min_messages
(enum
)
Controls which
message levels
are sent to the client.
Valid values are DEBUG5
,
DEBUG4
, DEBUG3
, DEBUG2
,
DEBUG1
, LOG
, NOTICE
,
WARNING
, and ERROR
.
Each level includes all the levels that follow it. The later the level,
the fewer messages are sent. The default is
NOTICE
. Note that LOG
has a different
rank here than in log_min_messages.
INFO
level messages are always sent to the client.
search_path
(string
)
This variable specifies the order in which schemas are searched when an object (table, data type, function, etc.) is referenced by a simple name with no schema specified. When there are objects of identical names in different schemas, the one found first in the search path is used. An object that is not in any of the schemas in the search path can only be referenced by specifying its containing schema with a qualified (dotted) name.
The value for search_path
must be a comma-separated
list of schema names. Any name that is not an existing schema, or is
a schema for which the user does not have USAGE
permission, is silently ignored.
If one of the list items is the special name
$user
, then the schema having the name returned by
CURRENT_USER
is substituted, if there is such a schema
and the user has USAGE
permission for it.
(If not, $user
is ignored.)
The system catalog schema, pg_catalog
, is always
searched, whether it is mentioned in the path or not. If it is
mentioned in the path then it will be searched in the specified
order. If pg_catalog
is not in the path then it will
be searched before searching any of the path items.
Likewise, the current session's temporary-table schema,
pg_temp_
, is always searched if it
exists. It can be explicitly listed in the path by using the
alias nnn
pg_temp
. If it is not listed in the path then
it is searched first (even before pg_catalog
). However,
the temporary schema is only searched for relation (table, view,
sequence, etc) and data type names. It is never searched for
function or operator names.
When objects are created without specifying a particular target
schema, they will be placed in the first valid schema named in
search_path
. An error is reported if the search
path is empty.
The default value for this parameter is
"$user", public
.
This setting supports shared use of a database (where no users
have private schemas, and all share use of public
),
private per-user schemas, and combinations of these. Other
effects can be obtained by altering the default search path
setting, either globally or per-user.
For more information on schema handling, see Section 5.9. In particular, the default configuration is suitable only when the database has a single user or a few mutually-trusting users.
The current effective value of the search path can be examined
via the SQL function
current_schemas
(see Section 9.26).
This is not quite the same as
examining the value of search_path
, since
current_schemas
shows how the items
appearing in search_path
were resolved.
row_security
(boolean
)
This variable controls whether to raise an error in lieu of applying a
row security policy. When set to on
, policies apply
normally. When set to off
, queries fail which would
otherwise apply at least one policy. The default is on
.
Change to off
where limited row visibility could cause
incorrect results; for example, pg_dump makes that
change by default. This variable has no effect on roles which bypass
every row security policy, to wit, superusers and roles with
the BYPASSRLS
attribute.
For more information on row security policies, see CREATE POLICY.
default_table_access_method
(string
)
This parameter specifies the default table access method to use when
creating tables or materialized views if the CREATE
command does not explicitly specify an access method, or when
SELECT ... INTO
is used, which does not allow
specifying a table access method. The default is heap
.
default_tablespace
(string
)
This variable specifies the default tablespace in which to create
objects (tables and indexes) when a CREATE
command does
not explicitly specify a tablespace.
The value is either the name of a tablespace, or an empty string
to specify using the default tablespace of the current database.
If the value does not match the name of any existing tablespace,
PostgreSQL will automatically use the default
tablespace of the current database. If a nondefault tablespace
is specified, the user must have CREATE
privilege
for it, or creation attempts will fail.
This variable is not used for temporary tables; for them, temp_tablespaces is consulted instead.
This variable is also not used when creating databases. By default, a new database inherits its tablespace setting from the template database it is copied from.
If this parameter is set to a value other than the empty string
when a partitioned table is created, the partitioned table's
tablespace will be set to that value, which will be used as
the default tablespace for partitions created in the future,
even if default_tablespace
has changed since then.
For more information on tablespaces, see Section 23.6.
default_toast_compression
(enum
)
This variable sets the default
TOAST
compression method for values of compressible columns.
(This can be overridden for individual columns by setting
the COMPRESSION
column option in
CREATE TABLE
or
ALTER TABLE
.)
The supported compression methods are pglz
and
(if PostgreSQL was compiled with
--with-lz4
) lz4
.
The default is pglz
.
temp_tablespaces
(string
)
This variable specifies tablespaces in which to create temporary
objects (temp tables and indexes on temp tables) when a
CREATE
command does not explicitly specify a tablespace.
Temporary files for purposes such as sorting large data sets
are also created in these tablespaces.
The value is a list of names of tablespaces. When there is more than one name in the list, PostgreSQL chooses a random member of the list each time a temporary object is to be created; except that within a transaction, successively created temporary objects are placed in successive tablespaces from the list. If the selected element of the list is an empty string, PostgreSQL will automatically use the default tablespace of the current database instead.
When temp_tablespaces
is set interactively, specifying a
nonexistent tablespace is an error, as is specifying a tablespace for
which the user does not have CREATE
privilege. However,
when using a previously set value, nonexistent tablespaces are
ignored, as are tablespaces for which the user lacks
CREATE
privilege. In particular, this rule applies when
using a value set in postgresql.conf
.
The default value is an empty string, which results in all temporary objects being created in the default tablespace of the current database.
See also default_tablespace.
check_function_bodies
(boolean
)
This parameter is normally on. When set to off
, it
disables validation of the routine body string during CREATE FUNCTION and CREATE PROCEDURE. Disabling validation avoids side
effects of the validation process, in particular preventing false
positives due to problems such as forward references.
Set this parameter
to off
before loading functions on behalf of other
users; pg_dump does so automatically.
default_transaction_isolation
(enum
)
Each SQL transaction has an isolation level, which can be either “read uncommitted”, “read committed”, “repeatable read”, or “serializable”. This parameter controls the default isolation level of each new transaction. The default is “read committed”.
Consult Chapter 13 and SET TRANSACTION for more information.
default_transaction_read_only
(boolean
)
A read-only SQL transaction cannot alter non-temporary tables.
This parameter controls the default read-only status of each new
transaction. The default is off
(read/write).
Consult SET TRANSACTION for more information.
default_transaction_deferrable
(boolean
)
When running at the serializable
isolation level,
a deferrable read-only SQL transaction may be delayed before
it is allowed to proceed. However, once it begins executing
it does not incur any of the overhead required to ensure
serializability; so serialization code will have no reason to
force it to abort because of concurrent updates, making this
option suitable for long-running read-only transactions.
This parameter controls the default deferrable status of each
new transaction. It currently has no effect on read-write
transactions or those operating at isolation levels lower
than serializable
. The default is off
.
Consult SET TRANSACTION for more information.
transaction_isolation
(enum
)
This parameter reflects the current transaction's isolation level. At the beginning of each transaction, it is set to the current value of default_transaction_isolation. Any subsequent attempt to change it is equivalent to a SET TRANSACTION command.
transaction_read_only
(boolean
)
This parameter reflects the current transaction's read-only status. At the beginning of each transaction, it is set to the current value of default_transaction_read_only. Any subsequent attempt to change it is equivalent to a SET TRANSACTION command.
transaction_deferrable
(boolean
)
This parameter reflects the current transaction's deferrability status. At the beginning of each transaction, it is set to the current value of default_transaction_deferrable. Any subsequent attempt to change it is equivalent to a SET TRANSACTION command.
session_replication_role
(enum
)
Controls firing of replication-related triggers and rules for the
current session. Setting this variable requires
superuser privilege and results in discarding any previously cached
query plans. Possible values are origin
(the default),
replica
and local
.
The intended use of this setting is that logical replication systems
set it to replica
when they are applying replicated
changes. The effect of that will be that triggers and rules (that
have not been altered from their default configuration) will not fire
on the replica. See the ALTER TABLE
clauses
ENABLE TRIGGER
and ENABLE RULE
for more information.
PostgreSQL treats the settings origin
and
local
the same internally. Third-party replication
systems may use these two values for their internal purposes, for
example using local
to designate a session whose
changes should not be replicated.
Since foreign keys are implemented as triggers, setting this parameter
to replica
also disables all foreign key checks,
which can leave data in an inconsistent state if improperly used.
statement_timeout
(integer
)
Abort any statement that takes more than the specified amount of time.
If log_min_error_statement
is set
to ERROR
or lower, the statement that timed out
will also be logged.
If this value is specified without units, it is taken as milliseconds.
A value of zero (the default) disables the timeout.
The timeout is measured from the time a command arrives at the server until it is completed by the server. If multiple SQL statements appear in a single simple-Query message, the timeout is applied to each statement separately. (PostgreSQL versions before 13 usually treated the timeout as applying to the whole query string.) In extended query protocol, the timeout starts running when any query-related message (Parse, Bind, Execute, Describe) arrives, and it is canceled by completion of an Execute or Sync message.
Setting statement_timeout
in
postgresql.conf
is not recommended because it would
affect all sessions.
lock_timeout
(integer
)
Abort any statement that waits longer than the specified amount of
time while attempting to acquire a lock on a table, index,
row, or other database object. The time limit applies separately to
each lock acquisition attempt. The limit applies both to explicit
locking requests (such as LOCK TABLE
, or SELECT
FOR UPDATE
without NOWAIT
) and to implicitly-acquired
locks.
If this value is specified without units, it is taken as milliseconds.
A value of zero (the default) disables the timeout.
Unlike statement_timeout
, this timeout can only occur
while waiting for locks. Note that if statement_timeout
is nonzero, it is rather pointless to set lock_timeout
to
the same or larger value, since the statement timeout would always
trigger first. If log_min_error_statement
is set to
ERROR
or lower, the statement that timed out will be
logged.
Setting lock_timeout
in
postgresql.conf
is not recommended because it would
affect all sessions.
idle_in_transaction_session_timeout
(integer
)
Terminate any session that has been idle (that is, waiting for a client query) within an open transaction for longer than the specified amount of time. If this value is specified without units, it is taken as milliseconds. A value of zero (the default) disables the timeout.
This option can be used to ensure that idle sessions do not hold locks for an unreasonable amount of time. Even when no significant locks are held, an open transaction prevents vacuuming away recently-dead tuples that may be visible only to this transaction; so remaining idle for a long time can contribute to table bloat. See Section 25.1 for more details.
idle_session_timeout
(integer
)
Terminate any session that has been idle (that is, waiting for a client query), but not within an open transaction, for longer than the specified amount of time. If this value is specified without units, it is taken as milliseconds. A value of zero (the default) disables the timeout.
Unlike the case with an open transaction, an idle session without a
transaction imposes no large costs on the server, so there is less
need to enable this timeout
than idle_in_transaction_session_timeout
.
Be wary of enforcing this timeout on connections made through connection-pooling software or other middleware, as such a layer may not react well to unexpected connection closure. It may be helpful to enable this timeout only for interactive sessions, perhaps by applying it only to particular users.
vacuum_freeze_table_age
(integer
)
VACUUM
performs an aggressive scan if the table's
pg_class
.relfrozenxid
field has reached
the age specified by this setting. An aggressive scan differs from
a regular VACUUM
in that it visits every page that might
contain unfrozen XIDs or MXIDs, not just those that might contain dead
tuples. The default is 150 million transactions. Although users can
set this value anywhere from zero to two billion, VACUUM
will silently limit the effective value to 95% of
autovacuum_freeze_max_age, so that a
periodic manual VACUUM
has a chance to run before an
anti-wraparound autovacuum is launched for the table. For more
information see
Section 25.1.5.
vacuum_freeze_min_age
(integer
)
Specifies the cutoff age (in transactions) that VACUUM
should use to decide whether to freeze row versions
while scanning a table.
The default is 50 million transactions. Although
users can set this value anywhere from zero to one billion,
VACUUM
will silently limit the effective value to half
the value of autovacuum_freeze_max_age, so
that there is not an unreasonably short time between forced
autovacuums. For more information see Section 25.1.5.
vacuum_failsafe_age
(integer
)
Specifies the maximum age (in transactions) that a table's
pg_class
.relfrozenxid
field can attain before VACUUM
takes
extraordinary measures to avoid system-wide transaction ID
wraparound failure. This is VACUUM
's
strategy of last resort. The failsafe typically triggers
when an autovacuum to prevent transaction ID wraparound has
already been running for some time, though it's possible for
the failsafe to trigger during any VACUUM
.
When the failsafe is triggered, any cost-based delay that is in effect will no longer be applied, and further non-essential maintenance tasks (such as index vacuuming) are bypassed.
The default is 1.6 billion transactions. Although users can
set this value anywhere from zero to 2.1 billion,
VACUUM
will silently adjust the effective
value to no less than 105% of autovacuum_freeze_max_age.
vacuum_multixact_freeze_table_age
(integer
)
VACUUM
performs an aggressive scan if the table's
pg_class
.relminmxid
field has reached
the age specified by this setting. An aggressive scan differs from
a regular VACUUM
in that it visits every page that might
contain unfrozen XIDs or MXIDs, not just those that might contain dead
tuples. The default is 150 million multixacts.
Although users can set this value anywhere from zero to two billion,
VACUUM
will silently limit the effective value to 95% of
autovacuum_multixact_freeze_max_age, so that a
periodic manual VACUUM
has a chance to run before an
anti-wraparound is launched for the table.
For more information see Section 25.1.5.1.
vacuum_multixact_freeze_min_age
(integer
)
Specifies the cutoff age (in multixacts) that VACUUM
should use to decide whether to replace multixact IDs with a newer
transaction ID or multixact ID while scanning a table. The default
is 5 million multixacts.
Although users can set this value anywhere from zero to one billion,
VACUUM
will silently limit the effective value to half
the value of autovacuum_multixact_freeze_max_age,
so that there is not an unreasonably short time between forced
autovacuums.
For more information see Section 25.1.5.1.
vacuum_multixact_failsafe_age
(integer
)
Specifies the maximum age (in multixacts) that a table's
pg_class
.relminmxid
field can attain before VACUUM
takes
extraordinary measures to avoid system-wide multixact ID
wraparound failure. This is VACUUM
's
strategy of last resort. The failsafe typically triggers when
an autovacuum to prevent transaction ID wraparound has already
been running for some time, though it's possible for the
failsafe to trigger during any VACUUM
.
When the failsafe is triggered, any cost-based delay that is in effect will no longer be applied, and further non-essential maintenance tasks (such as index vacuuming) are bypassed.
The default is 1.6 billion multixacts. Although users can set
this value anywhere from zero to 2.1 billion,
VACUUM
will silently adjust the effective
value to no less than 105% of autovacuum_multixact_freeze_max_age.
bytea_output
(enum
)
Sets the output format for values of type bytea
.
Valid values are hex
(the default)
and escape
(the traditional PostgreSQL
format). See Section 8.4 for more
information. The bytea
type always
accepts both formats on input, regardless of this setting.
xmlbinary
(enum
)
Sets how binary values are to be encoded in XML. This applies
for example when bytea
values are converted to
XML by the functions xmlelement
or
xmlforest
. Possible values are
base64
and hex
, which
are both defined in the XML Schema standard. The default is
base64
. For further information about
XML-related functions, see Section 9.15.
The actual choice here is mostly a matter of taste, constrained only by possible restrictions in client applications. Both methods support all possible values, although the hex encoding will be somewhat larger than the base64 encoding.
xmloption
(enum
)
Sets whether DOCUMENT
or
CONTENT
is implicit when converting between
XML and character string values. See Section 8.13 for a description of this. Valid
values are DOCUMENT
and
CONTENT
. The default is
CONTENT
.
According to the SQL standard, the command to set this option is
SET XML OPTION { DOCUMENT | CONTENT };
This syntax is also available in PostgreSQL.
gin_pending_list_limit
(integer
)
Sets the maximum size of a GIN index's pending list, which is used
when fastupdate
is enabled. If the list grows
larger than this maximum size, it is cleaned up by moving
the entries in it to the index's main GIN data structure in bulk.
If this value is specified without units, it is taken as kilobytes.
The default is four megabytes (4MB
). This setting
can be overridden for individual GIN indexes by changing
index storage parameters.
See Section 67.4.1 and Section 67.5
for more information.
restrict_nonsystem_relation_kind
(string
)
This variable specifies relation kind to which access is restricted.
It contains a comma-separated list of relation kind. Currently, the
supported relation kinds are view
and
foreign-table
.
DateStyle
(string
)
Sets the display format for date and time values, as well as the
rules for interpreting ambiguous date input values. For
historical reasons, this variable contains two independent
components: the output format specification (ISO
,
Postgres
, SQL
, or German
)
and the input/output specification for year/month/day ordering
(DMY
, MDY
, or YMD
). These
can be set separately or together. The keywords Euro
and European
are synonyms for DMY
; the
keywords US
, NonEuro
, and
NonEuropean
are synonyms for MDY
. See
Section 8.5 for more information. The
built-in default is ISO, MDY
, but
initdb will initialize the
configuration file with a setting that corresponds to the
behavior of the chosen lc_time
locale.
IntervalStyle
(enum
)
Sets the display format for interval values.
The value sql_standard
will produce
output matching SQL standard interval literals.
The value postgres
(which is the default) will produce
output matching PostgreSQL releases prior to 8.4
when the DateStyle
parameter was set to ISO
.
The value postgres_verbose
will produce output
matching PostgreSQL releases prior to 8.4
when the DateStyle
parameter was set to non-ISO
output.
The value iso_8601
will produce output matching the time
interval “format with designators” defined in section
4.4.3.2 of ISO 8601.
The IntervalStyle
parameter also affects the
interpretation of ambiguous interval input. See
Section 8.5.4 for more information.
TimeZone
(string
)
Sets the time zone for displaying and interpreting time stamps.
The built-in default is GMT
, but that is typically
overridden in postgresql.conf
; initdb
will install a setting there corresponding to its system environment.
See Section 8.5.3 for more information.
timezone_abbreviations
(string
)
Sets the collection of time zone abbreviations that will be accepted
by the server for datetime input. The default is 'Default'
,
which is a collection that works in most of the world; there are
also 'Australia'
and 'India'
,
and other collections can be defined for a particular installation.
See Section B.4 for more information.
extra_float_digits
(integer
)
This parameter adjusts the number of digits used for textual output of
floating-point values, including float4
, float8
,
and geometric data types.
If the value is 1 (the default) or above, float values are output in
shortest-precise format; see Section 8.1.3. The
actual number of digits generated depends only on the value being
output, not on the value of this parameter. At most 17 digits are
required for float8
values, and 9 for float4
values. This format is both fast and precise, preserving the original
binary float value exactly when correctly read. For historical
compatibility, values up to 3 are permitted.
If the value is zero or negative, then the output is rounded to a
given decimal precision. The precision used is the standard number of
digits for the type (FLT_DIG
or DBL_DIG
as appropriate) reduced according to the
value of this parameter. (For example, specifying -1 will cause
float4
values to be output rounded to 5 significant
digits, and float8
values
rounded to 14 digits.) This format is slower and does not preserve all
the bits of the binary float value, but may be more human-readable.
The meaning of this parameter, and its default value, changed in PostgreSQL 12; see Section 8.1.3 for further discussion.
client_encoding
(string
)
Sets the client-side encoding (character set). The default is to use the database encoding. The character sets supported by the PostgreSQL server are described in Section 24.3.1.
lc_messages
(string
)
Sets the language in which messages are displayed. Acceptable values are system-dependent; see Section 24.1 for more information. If this variable is set to the empty string (which is the default) then the value is inherited from the execution environment of the server in a system-dependent way.
On some systems, this locale category does not exist. Setting this variable will still work, but there will be no effect. Also, there is a chance that no translated messages for the desired language exist. In that case you will continue to see the English messages.
Only superusers can change this setting, because it affects the messages sent to the server log as well as to the client, and an improper value might obscure the readability of the server logs.
lc_monetary
(string
)
Sets the locale to use for formatting monetary amounts, for
example with the to_char
family of
functions. Acceptable values are system-dependent; see Section 24.1 for more information. If this variable is
set to the empty string (which is the default) then the value
is inherited from the execution environment of the server in a
system-dependent way.
lc_numeric
(string
)
Sets the locale to use for formatting numbers, for example
with the to_char
family of
functions. Acceptable values are system-dependent; see Section 24.1 for more information. If this variable is
set to the empty string (which is the default) then the value
is inherited from the execution environment of the server in a
system-dependent way.
lc_time
(string
)
Sets the locale to use for formatting dates and times, for example
with the to_char
family of
functions. Acceptable values are system-dependent; see Section 24.1 for more information. If this variable is
set to the empty string (which is the default) then the value
is inherited from the execution environment of the server in a
system-dependent way.
default_text_search_config
(string
)
Selects the text search configuration that is used by those variants
of the text search functions that do not have an explicit argument
specifying the configuration.
See Chapter 12 for further information.
The built-in default is pg_catalog.simple
, but
initdb will initialize the
configuration file with a setting that corresponds to the
chosen lc_ctype
locale, if a configuration
matching that locale can be identified.
Several settings are available for preloading shared libraries into the
server, in order to load additional functionality or achieve performance
benefits. For example, a setting of
'$libdir/mylib'
would cause
mylib.so
(or on some platforms,
mylib.sl
) to be preloaded from the installation's standard
library directory. The differences between the settings are when they
take effect and what privileges are required to change them.
PostgreSQL procedural language libraries can
be preloaded in this way, typically by using the
syntax '$libdir/plXXX'
where
XXX
is pgsql
, perl
,
tcl
, or python
.
Only shared libraries specifically intended to be used with PostgreSQL
can be loaded this way. Every PostgreSQL-supported library has
a “magic block” that is checked to guarantee compatibility. For
this reason, non-PostgreSQL libraries cannot be loaded in this way. You
might be able to use operating-system facilities such
as LD_PRELOAD
for that.
In general, refer to the documentation of a specific module for the recommended way to load that module.
local_preload_libraries
(string
)
This variable specifies one or more shared libraries that are to be
preloaded at connection start.
It contains a comma-separated list of library names, where each name
is interpreted as for the LOAD
command.
Whitespace between entries is ignored; surround a library name with
double quotes if you need to include whitespace or commas in the name.
The parameter value only takes effect at the start of the connection.
Subsequent changes have no effect. If a specified library is not
found, the connection attempt will fail.
This option can be set by any user. Because of that, the libraries
that can be loaded are restricted to those appearing in the
plugins
subdirectory of the installation's
standard library directory. (It is the database administrator's
responsibility to ensure that only “safe” libraries
are installed there.) Entries in local_preload_libraries
can specify this directory explicitly, for example
$libdir/plugins/mylib
, or just specify
the library name — mylib
would have
the same effect as $libdir/plugins/mylib
.
The intent of this feature is to allow unprivileged users to load
debugging or performance-measurement libraries into specific sessions
without requiring an explicit LOAD
command. To that end,
it would be typical to set this parameter using
the PGOPTIONS
environment variable on the client or by
using
ALTER ROLE SET
.
However, unless a module is specifically designed to be used in this way by non-superusers, this is usually not the right setting to use. Look at session_preload_libraries instead.
session_preload_libraries
(string
)
This variable specifies one or more shared libraries that are to be
preloaded at connection start.
It contains a comma-separated list of library names, where each name
is interpreted as for the LOAD
command.
Whitespace between entries is ignored; surround a library name with
double quotes if you need to include whitespace or commas in the name.
The parameter value only takes effect at the start of the connection.
Subsequent changes have no effect. If a specified library is not
found, the connection attempt will fail.
Only superusers can change this setting.
The intent of this feature is to allow debugging or
performance-measurement libraries to be loaded into specific sessions
without an explicit
LOAD
command being given. For
example, auto_explain could be enabled for all
sessions under a given user name by setting this parameter
with ALTER ROLE SET
. Also, this parameter can be changed
without restarting the server (but changes only take effect when a new
session is started), so it is easier to add new modules this way, even
if they should apply to all sessions.
Unlike shared_preload_libraries, there is no large performance advantage to loading a library at session start rather than when it is first used. There is some advantage, however, when connection pooling is used.
shared_preload_libraries
(string
)
This variable specifies one or more shared libraries to be preloaded at
server start.
It contains a comma-separated list of library names, where each name
is interpreted as for the LOAD
command.
Whitespace between entries is ignored; surround a library name with
double quotes if you need to include whitespace or commas in the name.
This parameter can only be set at server start. If a specified
library is not found, the server will fail to start.
Some libraries need to perform certain operations that can only take place at postmaster start, such as allocating shared memory, reserving light-weight locks, or starting background workers. Those libraries must be loaded at server start through this parameter. See the documentation of each library for details.
Other libraries can also be preloaded. By preloading a shared library, the library startup time is avoided when the library is first used. However, the time to start each new server process might increase slightly, even if that process never uses the library. So this parameter is recommended only for libraries that will be used in most sessions. Also, changing this parameter requires a server restart, so this is not the right setting to use for short-term debugging tasks, say. Use session_preload_libraries for that instead.
On Windows hosts, preloading a library at server start will not reduce
the time required to start each new server process; each server process
will re-load all preload libraries. However, shared_preload_libraries
is still useful on Windows hosts for libraries that need to
perform operations at postmaster start time.
jit_provider
(string
)
This variable is the name of the JIT provider library to be used
(see Section 32.4.2).
The default is llvmjit
.
This parameter can only be set at server start.
If set to a non-existent library, JIT will not be available, but no error will be raised. This allows JIT support to be installed separately from the main PostgreSQL package.
dynamic_library_path
(string
)
If a dynamically loadable module needs to be opened and the
file name specified in the CREATE FUNCTION
or
LOAD
command
does not have a directory component (i.e., the
name does not contain a slash), the system will search this
path for the required file.
The value for dynamic_library_path
must be a
list of absolute directory paths separated by colons (or semi-colons
on Windows). If a list element starts
with the special string $libdir
, the
compiled-in PostgreSQL package
library directory is substituted for $libdir
; this
is where the modules provided by the standard
PostgreSQL distribution are installed.
(Use pg_config --pkglibdir
to find out the name of
this directory.) For example:
dynamic_library_path = '/usr/local/lib/postgresql:/home/my_project/lib:$libdir'
or, in a Windows environment:
dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
The default value for this parameter is
'$libdir'
. If the value is set to an empty
string, the automatic path search is turned off.
This parameter can be changed at run time by superusers, but a
setting done that way will only persist until the end of the
client connection, so this method should be reserved for
development purposes. The recommended way to set this parameter
is in the postgresql.conf
configuration
file.
gin_fuzzy_search_limit
(integer
)
Soft upper limit of the size of the set returned by GIN index scans. For more information see Section 67.5.
deadlock_timeout
(integer
)
This is the amount of time to wait on a lock
before checking to see if there is a deadlock condition. The
check for deadlock is relatively expensive, so the server doesn't run
it every time it waits for a lock. We optimistically assume
that deadlocks are not common in production applications and
just wait on the lock for a while before checking for a
deadlock. Increasing this value reduces the amount of time
wasted in needless deadlock checks, but slows down reporting of
real deadlock errors.
If this value is specified without units, it is taken as milliseconds.
The default is one second (1s
),
which is probably about the smallest value you would want in
practice. On a heavily loaded server you might want to raise it.
Ideally the setting should exceed your typical transaction time,
so as to improve the odds that a lock will be released before
the waiter decides to check for deadlock. Only superusers can change
this setting.
When log_lock_waits is set,
this parameter also determines the amount of time to wait before
a log message is issued about the lock wait. If you are trying
to investigate locking delays you might want to set a shorter than
normal deadlock_timeout
.
max_locks_per_transaction
(integer
)
The shared lock table tracks locks on
max_locks_per_transaction
* (max_connections + max_prepared_transactions) objects (e.g., tables);
hence, no more than this many distinct objects can be locked at
any one time. This parameter controls the average number of object
locks allocated for each transaction; individual transactions
can lock more objects as long as the locks of all transactions
fit in the lock table. This is not the number of
rows that can be locked; that value is unlimited. The default,
64, has historically proven sufficient, but you might need to
raise this value if you have queries that touch many different
tables in a single transaction, e.g., query of a parent table with
many children. This parameter can only be set at server start.
When running a standby server, you must set this parameter to the same or higher value than on the primary server. Otherwise, queries will not be allowed in the standby server.
max_pred_locks_per_transaction
(integer
)
The shared predicate lock table tracks locks on
max_pred_locks_per_transaction
* (max_connections + max_prepared_transactions) objects (e.g., tables);
hence, no more than this many distinct objects can be locked at
any one time. This parameter controls the average number of object
locks allocated for each transaction; individual transactions
can lock more objects as long as the locks of all transactions
fit in the lock table. This is not the number of
rows that can be locked; that value is unlimited. The default,
64, has generally been sufficient in testing, but you might need to
raise this value if you have clients that touch many different
tables in a single serializable transaction. This parameter can
only be set at server start.
max_pred_locks_per_relation
(integer
)
This controls how many pages or tuples of a single relation can be
predicate-locked before the lock is promoted to covering the whole
relation. Values greater than or equal to zero mean an absolute
limit, while negative values
mean max_pred_locks_per_transaction divided by
the absolute value of this setting. The default is -2, which keeps
the behavior from previous versions of PostgreSQL.
This parameter can only be set in the postgresql.conf
file or on the server command line.
max_pred_locks_per_page
(integer
)
This controls how many rows on a single page can be predicate-locked
before the lock is promoted to covering the whole page. The default
is 2. This parameter can only be set in
the postgresql.conf
file or on the server command line.
array_nulls
(boolean
)
This controls whether the array input parser recognizes
unquoted NULL
as specifying a null array element.
By default, this is on
, allowing array values containing
null values to be entered. However, PostgreSQL versions
before 8.2 did not support null values in arrays, and therefore would
treat NULL
as specifying a normal array element with
the string value “NULL”. For backward compatibility with
applications that require the old behavior, this variable can be
turned off
.
Note that it is possible to create array values containing null values
even when this variable is off
.
backslash_quote
(enum
)
This controls whether a quote mark can be represented by
\'
in a string literal. The preferred, SQL-standard way
to represent a quote mark is by doubling it (''
) but
PostgreSQL has historically also accepted
\'
. However, use of \'
creates security risks
because in some client character set encodings, there are multibyte
characters in which the last byte is numerically equivalent to ASCII
\
. If client-side code does escaping incorrectly then an
SQL-injection attack is possible. This risk can be prevented by
making the server reject queries in which a quote mark appears to be
escaped by a backslash.
The allowed values of backslash_quote
are
on
(allow \'
always),
off
(reject always), and
safe_encoding
(allow only if client encoding does not
allow ASCII \
within a multibyte character).
safe_encoding
is the default setting.
Note that in a standard-conforming string literal, \
just
means \
anyway. This parameter only affects the handling of
non-standard-conforming literals, including
escape string syntax (E'...'
).
escape_string_warning
(boolean
)
When on, a warning is issued if a backslash (\
)
appears in an ordinary string literal ('...'
syntax) and standard_conforming_strings
is off.
The default is on
.
Applications that wish to use backslash as escape should be
modified to use escape string syntax (E'...'
),
because the default behavior of ordinary strings is now to treat
backslash as an ordinary character, per SQL standard. This variable
can be enabled to help locate code that needs to be changed.
lo_compat_privileges
(boolean
)
In PostgreSQL releases prior to 9.0, large objects
did not have access privileges and were, therefore, always readable
and writable by all users. Setting this variable to on
disables the new privilege checks, for compatibility with prior
releases. The default is off
.
Only superusers can change this setting.
Setting this variable does not disable all security checks related to large objects — only those for which the default behavior has changed in PostgreSQL 9.0.
quote_all_identifiers
(boolean
)
When the database generates SQL, force all identifiers to be quoted,
even if they are not (currently) keywords. This will affect the
output of EXPLAIN
as well as the results of functions
like pg_get_viewdef
. See also the
--quote-all-identifiers
option of
pg_dump and pg_dumpall.
standard_conforming_strings
(boolean
)
This controls whether ordinary string literals
('...'
) treat backslashes literally, as specified in
the SQL standard.
Beginning in PostgreSQL 9.1, the default is
on
(prior releases defaulted to off
).
Applications can check this
parameter to determine how string literals will be processed.
The presence of this parameter can also be taken as an indication
that the escape string syntax (E'...'
) is supported.
Escape string syntax (Section 4.1.2.2)
should be used if an application desires
backslashes to be treated as escape characters.
synchronize_seqscans
(boolean
)
This allows sequential scans of large tables to synchronize with each
other, so that concurrent scans read the same block at about the
same time and hence share the I/O workload. When this is enabled,
a scan might start in the middle of the table and then “wrap
around” the end to cover all rows, so as to synchronize with the
activity of scans already in progress. This can result in
unpredictable changes in the row ordering returned by queries that
have no ORDER BY
clause. Setting this parameter to
off
ensures the pre-8.3 behavior in which a sequential
scan always starts from the beginning of the table. The default
is on
.
transform_null_equals
(boolean
)
When on, expressions of the form
(or expr
=
NULLNULL =
) are treated as
expr
, that is, they
return true if expr
IS NULLexpr
evaluates to the null value,
and false otherwise. The correct SQL-spec-compliant behavior of
is to always
return null (unknown). Therefore this parameter defaults to
expr
= NULLoff
.
However, filtered forms in Microsoft
Access generate queries that appear to use
to test for
null values, so if you use that interface to access the database you
might want to turn this option on. Since expressions of the
form expr
= NULL
always
return the null value (using the SQL standard interpretation), they are not
very useful and do not appear often in normal applications so
this option does little harm in practice. But new users are
frequently confused about the semantics of expressions
involving null values, so this option is off by default.
expr
= NULL
Note that this option only affects the exact form = NULL
,
not other comparison operators or other expressions
that are computationally equivalent to some expression
involving the equals operator (such as IN
).
Thus, this option is not a general fix for bad programming.
Refer to Section 9.2 for related information.
exit_on_error
(boolean
)
If on, any error will terminate the current session. By default, this is set to off, so that only FATAL errors will terminate the session.
restart_after_crash
(boolean
)
When set to on, which is the default, PostgreSQL will automatically reinitialize after a backend crash. Leaving this value set to on is normally the best way to maximize the availability of the database. However, in some circumstances, such as when PostgreSQL is being invoked by clusterware, it may be useful to disable the restart so that the clusterware can gain control and take any actions it deems appropriate.
This parameter can only be set in the postgresql.conf
file or on the server command line.
data_sync_retry
(boolean
)
When set to off, which is the default, PostgreSQL will raise a PANIC-level error on failure to flush modified data files to the file system. This causes the database server to crash. This parameter can only be set at server start.
On some operating systems, the status of data in the kernel's page cache is unknown after a write-back failure. In some cases it might have been entirely forgotten, making it unsafe to retry; the second attempt may be reported as successful, when in fact the data has been lost. In these circumstances, the only way to avoid data loss is to recover from the WAL after any failure is reported, preferably after investigating the root cause of the failure and replacing any faulty hardware.
If set to on, PostgreSQL will instead report an error but continue to run so that the data flushing operation can be retried in a later checkpoint. Only set it to on after investigating the operating system's treatment of buffered data in case of write-back failure.
recovery_init_sync_method
(enum
)
When set to fsync
, which is the default,
PostgreSQL will recursively open and
synchronize all files in the data directory before crash recovery
begins. The search for files will follow symbolic links for the WAL
directory and each configured tablespace (but not any other symbolic
links). This is intended to make sure that all WAL and data files are
durably stored on disk before replaying changes. This applies whenever
starting a database cluster that did not shut down cleanly, including
copies created with pg_basebackup.
On Linux, syncfs
may be used instead, to ask the
operating system to synchronize the whole file systems that contain the
data directory, the WAL files and each tablespace (but not any other
file systems that may be reachable through symbolic links). This may
be a lot faster than the fsync
setting, because it
doesn't need to open each file one by one. On the other hand, it may
be slower if a file system is shared by other applications that
modify a lot of files, since those files will also be written to disk.
Furthermore, on versions of Linux before 5.8, I/O errors encountered
while writing data to disk may not be reported to
PostgreSQL, and relevant error messages may
appear only in kernel logs.
This parameter can only be set in the
postgresql.conf
file or on the server command line.
The following “parameters” are read-only.
As such, they have been excluded from the sample
postgresql.conf
file. These options report
various aspects of PostgreSQL behavior
that might be of interest to certain applications, particularly
administrative front-ends.
Most of them are determined when PostgreSQL
is compiled or when it is installed.
block_size
(integer
)
Reports the size of a disk block. It is determined by the value
of BLCKSZ
when building the server. The default
value is 8192 bytes. The meaning of some configuration
variables (such as shared_buffers) is
influenced by block_size
. See Section 20.4 for information.
data_checksums
(boolean
)
Reports whether data checksums are enabled for this cluster. See data checksums for more information.
data_directory_mode
(integer
)
On Unix systems this parameter reports the permissions the data
directory (defined by data_directory)
had at server startup.
(On Microsoft Windows this parameter will always display
0700
.) See
group access for more information.
debug_assertions
(boolean
)
Reports whether PostgreSQL has been built
with assertions enabled. That is the case if the
macro USE_ASSERT_CHECKING
is defined
when PostgreSQL is built (accomplished
e.g., by the configure
option
--enable-cassert
). By
default PostgreSQL is built without
assertions.
integer_datetimes
(boolean
)
Reports whether PostgreSQL was built with support for
64-bit-integer dates and times. As of PostgreSQL 10,
this is always on
.
in_hot_standby
(boolean
)
Reports whether the server is currently in hot standby mode. When
this is on
, all transactions are forced to be
read-only. Within a session, this can change only if the server is
promoted to be primary. See Section 27.4 for more
information.
lc_collate
(string
)
Reports the locale in which sorting of textual data is done. See Section 24.1 for more information. This value is determined when a database is created.
lc_ctype
(string
)
Reports the locale that determines character classifications.
See Section 24.1 for more information.
This value is determined when a database is created.
Ordinarily this will be the same as lc_collate
,
but for special applications it might be set differently.
max_function_args
(integer
)
Reports the maximum number of function arguments. It is determined by
the value of FUNC_MAX_ARGS
when building the server. The
default value is 100 arguments.
max_identifier_length
(integer
)
Reports the maximum identifier length. It is determined as one
less than the value of NAMEDATALEN
when building
the server. The default value of NAMEDATALEN
is
64; therefore the default
max_identifier_length
is 63 bytes, which
can be less than 63 characters when using multibyte encodings.
max_index_keys
(integer
)
Reports the maximum number of index keys. It is determined by
the value of INDEX_MAX_KEYS
when building the server. The
default value is 32 keys.
segment_size
(integer
)
Reports the number of blocks (pages) that can be stored within a file
segment. It is determined by the value of RELSEG_SIZE
when building the server. The maximum size of a segment file in bytes
is equal to segment_size
multiplied by
block_size
; by default this is 1GB.
server_encoding
(string
)
Reports the database encoding (character set). It is determined when the database is created. Ordinarily, clients need only be concerned with the value of client_encoding.
server_version
(string
)
Reports the version number of the server. It is determined by the
value of PG_VERSION
when building the server.
server_version_num
(integer
)
Reports the version number of the server as an integer. It is determined
by the value of PG_VERSION_NUM
when building the server.
ssl_library
(string
)
Reports the name of the SSL library that this
PostgreSQL server was built with (even if
SSL is not currently configured or in use on this instance), for
example OpenSSL
, or an empty string if none.
wal_block_size
(integer
)
Reports the size of a WAL disk block. It is determined by the value
of XLOG_BLCKSZ
when building the server. The default value
is 8192 bytes.
wal_segment_size
(integer
)
Reports the size of write ahead log segments. The default value is 16MB. See Section 30.5 for more information.
This feature was designed to allow parameters not normally known to PostgreSQL to be added by add-on modules (such as procedural languages). This allows extension modules to be configured in the standard ways.
Custom options have two-part names: an extension name, then a dot, then
the parameter name proper, much like qualified names in SQL. An example
is plpgsql.variable_conflict
.
Because custom options may need to be set in processes that have not loaded the relevant extension module, PostgreSQL will accept a setting for any two-part parameter name. Such variables are treated as placeholders and have no function until the module that defines them is loaded. When an extension module is loaded, it will add its variable definitions, convert any placeholder values according to those definitions, and issue warnings for any unrecognized placeholders that begin with its extension name.
The following parameters are intended for developer testing, and
should never be used on a production database. However, some of
them can be used to assist with the recovery of severely damaged
databases. As such, they have been excluded from the sample
postgresql.conf
file. Note that many of these
parameters require special source compilation flags to work at all.
allow_in_place_tablespaces
(boolean
)
Allows tablespaces to be created as directories inside
pg_tblspc
, when an empty location string
is provided to the CREATE TABLESPACE
command. This
is intended to allow testing replication scenarios where primary and
standby servers are running on the same machine. Such directories
are likely to confuse backup tools that expect to find only symbolic
links in that location. Only superusers can change this setting.
allow_system_table_mods
(boolean
)
Allows modification of the structure of system tables as well as certain other risky actions on system tables. This is otherwise not allowed even for superusers. Ill-advised use of this setting can cause irretrievable data loss or seriously corrupt the database system. Only superusers can change this setting.
backtrace_functions
(string
)
This parameter contains a comma-separated list of C function names. If an error is raised and the name of the internal C function where the error happens matches a value in the list, then a backtrace is written to the server log together with the error message. This can be used to debug specific areas of the source code.
Backtrace support is not available on all platforms, and the quality of the backtraces depends on compilation options.
This parameter can only be set by superusers.
debug_discard_caches
(integer
)
When set to 1
, each system catalog cache entry is
invalidated at the first possible opportunity, whether or not
anything that would render it invalid really occurred. Caching of
system catalogs is effectively disabled as a result, so the server
will run extremely slowly. Higher values run the cache invalidation
recursively, which is even slower and only useful for testing
the caching logic itself. The default value of 0
selects normal catalog caching behavior.
This parameter can be very helpful when trying to trigger
hard-to-reproduce bugs involving concurrent catalog changes, but it
is otherwise rarely needed. See the source code files
inval.c
and
pg_config_manual.h
for details.
This parameter is supported when
DISCARD_CACHES_ENABLED
was defined at compile time
(which happens automatically when using the
configure option
--enable-cassert
). In production builds, its value
will always be 0
and attempts to set it to another
value will raise an error.
force_parallel_mode
(enum
)
Allows the use of parallel queries for testing purposes even in cases
where no performance benefit is expected.
The allowed values of force_parallel_mode
are
off
(use parallel mode only when it is expected to improve
performance), on
(force parallel query for all queries
for which it is thought to be safe), and regress
(like
on
, but with additional behavior changes as explained
below).
More specifically, setting this value to on
will add
a Gather
node to the top of any query plan for which this
appears to be safe, so that the query runs inside of a parallel worker.
Even when a parallel worker is not available or cannot be used,
operations such as starting a subtransaction that would be prohibited
in a parallel query context will be prohibited unless the planner
believes that this will cause the query to fail. If failures or
unexpected results occur when this option is set, some functions used
by the query may need to be marked PARALLEL UNSAFE
(or, possibly, PARALLEL RESTRICTED
).
Setting this value to regress
has all of the same effects
as setting it to on
plus some additional effects that are
intended to facilitate automated regression testing. Normally,
messages from a parallel worker include a context line indicating that,
but a setting of regress
suppresses this line so that the
output is the same as in non-parallel execution. Also,
the Gather
nodes added to plans by this setting are hidden
in EXPLAIN
output so that the output matches what
would be obtained if this setting were turned off
.
ignore_system_indexes
(boolean
)
Ignore system indexes when reading system tables (but still update the indexes when modifying the tables). This is useful when recovering from damaged system indexes. This parameter cannot be changed after session start.
post_auth_delay
(integer
)
The amount of time to delay when a new server process is started, after it conducts the authentication procedure. This is intended to give developers an opportunity to attach to the server process with a debugger. If this value is specified without units, it is taken as seconds. A value of zero (the default) disables the delay. This parameter cannot be changed after session start.
pre_auth_delay
(integer
)
The amount of time to delay just after a
new server process is forked, before it conducts the
authentication procedure. This is intended to give developers an
opportunity to attach to the server process with a debugger to
trace down misbehavior in authentication.
If this value is specified without units, it is taken as seconds.
A value of zero (the default) disables the delay.
This parameter can only be set in the postgresql.conf
file or on the server command line.
trace_notify
(boolean
)
Generates a great amount of debugging output for the
LISTEN
and NOTIFY
commands. client_min_messages or
log_min_messages must be
DEBUG1
or lower to send this output to the
client or server logs, respectively.
trace_recovery_messages
(enum
)
Enables logging of recovery-related debugging output that otherwise
would not be logged. This parameter allows the user to override the
normal setting of log_min_messages, but only for
specific messages. This is intended for use in debugging Hot Standby.
Valid values are DEBUG5
, DEBUG4
,
DEBUG3
, DEBUG2
, DEBUG1
, and
LOG
. The default, LOG
, does not affect
logging decisions at all. The other values cause recovery-related
debug messages of that priority or higher to be logged as though they
had LOG
priority; for common settings of
log_min_messages
this results in unconditionally sending
them to the server log.
This parameter can only be set in the postgresql.conf
file or on the server command line.
trace_sort
(boolean
)
If on, emit information about resource usage during sort operations.
This parameter is only available if the TRACE_SORT
macro
was defined when PostgreSQL was compiled.
(However, TRACE_SORT
is currently defined by default.)
trace_locks
(boolean
)
If on, emit information about lock usage. Information dumped includes the type of lock operation, the type of lock and the unique identifier of the object being locked or unlocked. Also included are bit masks for the lock types already granted on this object as well as for the lock types awaited on this object. For each lock type a count of the number of granted locks and waiting locks is also dumped as well as the totals. An example of the log file output is shown here:
LOG: LockAcquire: new: lock(0xb7acd844) id(24688,24696,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(AccessShareLock) LOG: GrantLock: lock(0xb7acd844) id(24688,24696,0,0,0,1) grantMask(2) req(1,0,0,0,0,0,0)=1 grant(1,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock) LOG: UnGrantLock: updated: lock(0xb7acd844) id(24688,24696,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(AccessShareLock) LOG: CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(INVALID)
Details of the structure being dumped may be found in
src/include/storage/lock.h
.
This parameter is only available if the LOCK_DEBUG
macro was defined when PostgreSQL was
compiled.
trace_lwlocks
(boolean
)
If on, emit information about lightweight lock usage. Lightweight locks are intended primarily to provide mutual exclusion of access to shared-memory data structures.
This parameter is only available if the LOCK_DEBUG
macro was defined when PostgreSQL was
compiled.
trace_userlocks
(boolean
)
If on, emit information about user lock usage. Output is the same
as for trace_locks
, only for advisory locks.
This parameter is only available if the LOCK_DEBUG
macro was defined when PostgreSQL was
compiled.
trace_lock_oidmin
(integer
)
If set, do not trace locks for tables below this OID (used to avoid output on system tables).
This parameter is only available if the LOCK_DEBUG
macro was defined when PostgreSQL was
compiled.
trace_lock_table
(integer
)
Unconditionally trace locks on this table (OID).
This parameter is only available if the LOCK_DEBUG
macro was defined when PostgreSQL was
compiled.
debug_deadlocks
(boolean
)
If set, dumps information about all current locks when a deadlock timeout occurs.
This parameter is only available if the LOCK_DEBUG
macro was defined when PostgreSQL was
compiled.
log_btree_build_stats
(boolean
)
If set, logs system resource usage statistics (memory and CPU) on various B-tree operations.
This parameter is only available if the BTREE_BUILD_STATS
macro was defined when PostgreSQL was
compiled.
wal_consistency_checking
(string
)
This parameter is intended to be used to check for bugs in the WAL redo routines. When enabled, full-page images of any buffers modified in conjunction with the WAL record are added to the record. If the record is subsequently replayed, the system will first apply each record and then test whether the buffers modified by the record match the stored images. In certain cases (such as hint bits), minor variations are acceptable, and will be ignored. Any unexpected differences will result in a fatal error, terminating recovery.
The default value of this setting is the empty string, which disables
the feature. It can be set to all
to check all
records, or to a comma-separated list of resource managers to check
only records originating from those resource managers. Currently,
the supported resource managers are heap
,
heap2
, btree
, hash
,
gin
, gist
, sequence
,
spgist
, brin
, and generic
. Only
superusers can change this setting.
wal_debug
(boolean
)
If on, emit WAL-related debugging output. This parameter is
only available if the WAL_DEBUG
macro was
defined when PostgreSQL was
compiled.
ignore_checksum_failure
(boolean
)
Only has effect if data checksums are enabled.
Detection of a checksum failure during a read normally causes
PostgreSQL to report an error, aborting the current
transaction. Setting ignore_checksum_failure
to on causes
the system to ignore the failure (but still report a warning), and
continue processing. This behavior may cause crashes, propagate
or hide corruption, or other serious problems. However, it may allow
you to get past the error and retrieve undamaged tuples that might still be
present in the table if the block header is still sane. If the header is
corrupt an error will be reported even if this option is enabled. The
default setting is off
, and it can only be changed by a superuser.
zero_damaged_pages
(boolean
)
Detection of a damaged page header normally causes
PostgreSQL to report an error, aborting the current
transaction. Setting zero_damaged_pages
to on causes
the system to instead report a warning, zero out the damaged
page in memory, and continue processing. This behavior will destroy data,
namely all the rows on the damaged page. However, it does allow you to get
past the error and retrieve rows from any undamaged pages that might
be present in the table. It is useful for recovering data if
corruption has occurred due to a hardware or software error. You should
generally not set this on until you have given up hope of recovering
data from the damaged pages of a table. Zeroed-out pages are not
forced to disk so it is recommended to recreate the table or
the index before turning this parameter off again. The
default setting is off
, and it can only be changed
by a superuser.
ignore_invalid_pages
(boolean
)
If set to off
(the default), detection of
WAL records having references to invalid pages during
recovery causes PostgreSQL to
raise a PANIC-level error, aborting the recovery. Setting
ignore_invalid_pages
to on
causes the system to ignore invalid page references in WAL records
(but still report a warning), and continue the recovery.
This behavior may cause crashes, data loss,
propagate or hide corruption, or other serious problems.
However, it may allow you to get past the PANIC-level error,
to finish the recovery, and to cause the server to start up.
The parameter can only be set at server start. It only has effect
during recovery or in standby mode.
jit_debugging_support
(boolean
)
If LLVM has the required functionality, register generated functions
with GDB. This makes debugging easier.
The default setting is off
.
This parameter can only be set at server start.
jit_dump_bitcode
(boolean
)
Writes the generated LLVM IR out to the
file system, inside data_directory. This is only
useful for working on the internals of the JIT implementation.
The default setting is off
.
This parameter can only be changed by a superuser.
jit_expressions
(boolean
)
Determines whether expressions are JIT compiled, when JIT compilation
is activated (see Section 32.2). The default is
on
.
jit_profiling_support
(boolean
)
If LLVM has the required functionality, emit the data needed to allow
perf to profile functions generated by JIT.
This writes out files to ~/.debug/jit/
; the
user is responsible for performing cleanup when desired.
The default setting is off
.
This parameter can only be set at server start.
jit_tuple_deforming
(boolean
)
Determines whether tuple deforming is JIT compiled, when JIT
compilation is activated (see Section 32.2).
The default is on
.
remove_temp_files_after_crash
(boolean
)
When set to on
, which is the default,
PostgreSQL will automatically remove
temporary files after a backend crash. If disabled, the files will be
retained and may be used for debugging, for example. Repeated crashes
may however result in accumulation of useless files. This parameter
can only be set in the postgresql.conf
file or on
the server command line.
For convenience there are also single letter command-line option switches available for some parameters. They are described in Table 20.3. Some of these options exist for historical reasons, and their presence as a single-letter option does not necessarily indicate an endorsement to use the option heavily.
Table 20.3. Short Option Key
Short Option | Equivalent |
---|---|
-B | shared_buffers = |
-d | log_min_messages = DEBUG |
-e | datestyle = euro |
-fb , -fh , -fi ,
-fm , -fn , -fo ,
-fs , -ft
|
enable_bitmapscan = off ,
enable_hashjoin = off ,
enable_indexscan = off ,
enable_mergejoin = off ,
enable_nestloop = off ,
enable_indexonlyscan = off ,
enable_seqscan = off ,
enable_tidscan = off
|
-F | fsync = off |
-h | listen_addresses = |
-i | listen_addresses = '*' |
-k | unix_socket_directories = |
-l | ssl = on |
-N | max_connections = |
-O | allow_system_table_mods = on |
-p | port = |
-P | ignore_system_indexes = on |
-s | log_statement_stats = on |
-S | work_mem = |
-tpa , -tpl , -te | log_parser_stats = on ,
log_planner_stats = on ,
log_executor_stats = on |
-W | post_auth_delay = |
Table of Contents
pg_hba.conf
FileWhen a client application connects to the database server, it specifies which PostgreSQL database user name it wants to connect as, much the same way one logs into a Unix computer as a particular user. Within the SQL environment the active database user name determines access privileges to database objects — see Chapter 22 for more information. Therefore, it is essential to restrict which database users can connect.
As explained in Chapter 22,
PostgreSQL actually does privilege
management in terms of “roles”. In this chapter, we
consistently use database user to mean “role with the
LOGIN
privilege”.
Authentication is the process by which the database server establishes the identity of the client, and by extension determines whether the client application (or the user who runs the client application) is permitted to connect with the database user name that was requested.
PostgreSQL offers a number of different client authentication methods. The method used to authenticate a particular client connection can be selected on the basis of (client) host address, database, and user.
PostgreSQL database user names are logically separate from user names of the operating system in which the server runs. If all the users of a particular server also have accounts on the server's machine, it makes sense to assign database user names that match their operating system user names. However, a server that accepts remote connections might have many database users who have no local operating system account, and in such cases there need be no connection between database user names and OS user names.
pg_hba.conf
File
Client authentication is controlled by a configuration file,
which traditionally is named
pg_hba.conf
and is stored in the database
cluster's data directory.
(HBA stands for host-based authentication.) A default
pg_hba.conf
file is installed when the data
directory is initialized by initdb
. It is
possible to place the authentication configuration file elsewhere,
however; see the hba_file configuration parameter.
The general format of the pg_hba.conf
file is
a set of records, one per line. Blank lines are ignored, as is any
text after the #
comment character.
A record can be continued onto the next line by ending the line with
a backslash. (Backslashes are not special except at the end of a line.)
A record is made
up of a number of fields which are separated by spaces and/or tabs.
Fields can contain white space if the field value is double-quoted.
Quoting one of the keywords in a database, user, or address field (e.g.,
all
or replication
) makes the word lose its special
meaning, and just match a database, user, or host with that name.
Backslash line continuation applies even within quoted text or comments.
Each record specifies a connection type, a client IP address range (if relevant for the connection type), a database name, a user name, and the authentication method to be used for connections matching these parameters. The first record with a matching connection type, client address, requested database, and user name is used to perform authentication. There is no “fall-through” or “backup”: if one record is chosen and the authentication fails, subsequent records are not considered. If no record matches, access is denied.
A record can have several formats:
localdatabase
user
auth-method
[auth-options
] hostdatabase
user
address
auth-method
[auth-options
] hostssldatabase
user
address
auth-method
[auth-options
] hostnossldatabase
user
address
auth-method
[auth-options
] hostgssencdatabase
user
address
auth-method
[auth-options
] hostnogssencdatabase
user
address
auth-method
[auth-options
] hostdatabase
user
IP-address
IP-mask
auth-method
[auth-options
] hostssldatabase
user
IP-address
IP-mask
auth-method
[auth-options
] hostnossldatabase
user
IP-address
IP-mask
auth-method
[auth-options
] hostgssencdatabase
user
IP-address
IP-mask
auth-method
[auth-options
] hostnogssencdatabase
user
IP-address
IP-mask
auth-method
[auth-options
]
The meaning of the fields is as follows:
local
This record matches connection attempts using Unix-domain sockets. Without a record of this type, Unix-domain socket connections are disallowed.
host
This record matches connection attempts made using TCP/IP.
host
records match
SSL or non-SSL connection
attempts as well as GSSAPI encrypted or
non-GSSAPI encrypted connection attempts.
Remote TCP/IP connections will not be possible unless
the server is started with an appropriate value for the
listen_addresses configuration parameter,
since the default behavior is to listen for TCP/IP connections
only on the local loopback address localhost
.
hostssl
This record matches connection attempts made using TCP/IP, but only when the connection is made with SSL encryption.
To make use of this option the server must be built with
SSL support. Furthermore,
SSL must be enabled
by setting the ssl configuration parameter (see
Section 19.9 for more information).
Otherwise, the hostssl
record is ignored except for
logging a warning that it cannot match any connections.
hostnossl
This record type has the opposite behavior of hostssl
;
it only matches connection attempts made over
TCP/IP that do not use SSL.
hostgssenc
This record matches connection attempts made using TCP/IP, but only when the connection is made with GSSAPI encryption.
To make use of this option the server must be built with
GSSAPI support. Otherwise,
the hostgssenc
record is ignored except for logging
a warning that it cannot match any connections.
hostnogssenc
This record type has the opposite behavior of hostgssenc
;
it only matches connection attempts made over
TCP/IP that do not use GSSAPI encryption.
database
Specifies which database name(s) this record matches. The value
all
specifies that it matches all databases.
The value sameuser
specifies that the record
matches if the requested database has the same name as the
requested user. The value samerole
specifies that
the requested user must be a member of the role with the same
name as the requested database. (samegroup
is an
obsolete but still accepted spelling of samerole
.)
Superusers are not considered to be members of a role for the
purposes of samerole
unless they are explicitly
members of the role, directly or indirectly, and not just by
virtue of being a superuser.
The value replication
specifies that the record
matches if a physical replication connection is requested, however, it
doesn't match with logical replication connections. Note that physical
replication connections do not specify any particular database whereas
logical replication connections do specify it.
Otherwise, this is the name of
a specific PostgreSQL database.
Multiple database names can be supplied by separating them with
commas. A separate file containing database names can be specified by
preceding the file name with @
.
user
Specifies which database user name(s) this record
matches. The value all
specifies that it
matches all users. Otherwise, this is either the name of a specific
database user, or a group name preceded by +
.
(Recall that there is no real distinction between users and groups
in PostgreSQL; a +
mark really means
“match any of the roles that are directly or indirectly members
of this role”, while a name without a +
mark matches
only that specific role.) For this purpose, a superuser is only
considered to be a member of a role if they are explicitly a member
of the role, directly or indirectly, and not just by virtue of
being a superuser.
Multiple user names can be supplied by separating them with commas.
A separate file containing user names can be specified by preceding the
file name with @
.
address
Specifies the client machine address(es) that this record matches. This field can contain either a host name, an IP address range, or one of the special key words mentioned below.
An IP address range is specified using standard numeric notation
for the range's starting address, then a slash (/
)
and a CIDR mask length. The mask
length indicates the number of high-order bits of the client
IP address that must match. Bits to the right of this should
be zero in the given IP address.
There must not be any white space between the IP address, the
/
, and the CIDR mask length.
Typical examples of an IPv4 address range specified this way are
172.20.143.89/32
for a single host, or
172.20.143.0/24
for a small network, or
10.6.0.0/16
for a larger one.
An IPv6 address range might look like ::1/128
for a single host (in this case the IPv6 loopback address) or
fe80::7a31:c1ff:0000:0000/96
for a small
network.
0.0.0.0/0
represents all
IPv4 addresses, and ::0/0
represents
all IPv6 addresses.
To specify a single host, use a mask length of 32 for IPv4 or
128 for IPv6. In a network address, do not omit trailing zeroes.
An entry given in IPv4 format will match only IPv4 connections, and an entry given in IPv6 format will match only IPv6 connections, even if the represented address is in the IPv4-in-IPv6 range. Note that entries in IPv6 format will be rejected if the system's C library does not have support for IPv6 addresses.
You can also write all
to match any IP address,
samehost
to match any of the server's own IP
addresses, or samenet
to match any address in any
subnet that the server is directly connected to.
If a host name is specified (anything that is not an IP address
range or a special key word is treated as a host name),
that name is compared with the result of a reverse name
resolution of the client's IP address (e.g., reverse DNS
lookup, if DNS is used). Host name comparisons are case
insensitive. If there is a match, then a forward name
resolution (e.g., forward DNS lookup) is performed on the host
name to check whether any of the addresses it resolves to are
equal to the client's IP address. If both directions match,
then the entry is considered to match. (The host name that is
used in pg_hba.conf
should be the one that
address-to-name resolution of the client's IP address returns,
otherwise the line won't be matched. Some host name databases
allow associating an IP address with multiple host names, but
the operating system will only return one host name when asked
to resolve an IP address.)
A host name specification that starts with a dot
(.
) matches a suffix of the actual host
name. So .example.com
would match
foo.example.com
(but not just
example.com
).
When host names are specified
in pg_hba.conf
, you should make sure that
name resolution is reasonably fast. It can be of advantage to
set up a local name resolution cache such
as nscd
. Also, you may wish to enable the
configuration parameter log_hostname
to see
the client's host name instead of the IP address in the log.
These fields do not apply to local
records.
Users sometimes wonder why host names are handled
in this seemingly complicated way, with two name resolutions
including a reverse lookup of the client's IP address. This
complicates use of the feature in case the client's reverse DNS
entry is not set up or yields some undesirable host name.
It is done primarily for efficiency: this way, a connection attempt
requires at most two resolver lookups, one reverse and one forward.
If there is a resolver problem with some address, it becomes only
that client's problem. A hypothetical alternative
implementation that only did forward lookups would have to
resolve every host name mentioned in
pg_hba.conf
during every connection attempt.
That could be quite slow if many names are listed.
And if there is a resolver problem with one of the host names,
it becomes everyone's problem.
Also, a reverse lookup is necessary to implement the suffix matching feature, because the actual client host name needs to be known in order to match it against the pattern.
Note that this behavior is consistent with other popular implementations of host name-based access control, such as the Apache HTTP Server and TCP Wrappers.
IP-address
IP-mask
These two fields can be used as an alternative to the
IP-address
/
mask-length
notation. Instead of
specifying the mask length, the actual mask is specified in a
separate column. For example, 255.0.0.0
represents an IPv4
CIDR mask length of 8, and 255.255.255.255
represents a
CIDR mask length of 32.
These fields do not apply to local
records.
auth-method
Specifies the authentication method to use when a connection matches
this record. The possible choices are summarized here; details
are in Section 21.3. All the options
are lower case and treated case sensitively, so even acronyms like
ldap
must be specified as lower case.
trust
Allow the connection unconditionally. This method allows anyone that can connect to the PostgreSQL database server to login as any PostgreSQL user they wish, without the need for a password or any other authentication. See Section 21.4 for details.
reject
Reject the connection unconditionally. This is useful for
“filtering out” certain hosts from a group, for example a
reject
line could block a specific host from connecting,
while a later line allows the remaining hosts in a specific
network to connect.
scram-sha-256
Perform SCRAM-SHA-256 authentication to verify the user's password. See Section 21.5 for details.
md5
Perform SCRAM-SHA-256 or MD5 authentication to verify the user's password. See Section 21.5 for details.
password
Require the client to supply an unencrypted password for authentication. Since the password is sent in clear text over the network, this should not be used on untrusted networks. See Section 21.5 for details.
gss
Use GSSAPI to authenticate the user. This is only available for TCP/IP connections. See Section 21.6 for details. It can be used in conjunction with GSSAPI encryption.
sspi
Use SSPI to authenticate the user. This is only available on Windows. See Section 21.7 for details.
ident
Obtain the operating system user name of the client by contacting the ident server on the client and check if it matches the requested database user name. Ident authentication can only be used on TCP/IP connections. When specified for local connections, peer authentication will be used instead. See Section 21.8 for details.
peer
Obtain the client's operating system user name from the operating system and check if it matches the requested database user name. This is only available for local connections. See Section 21.9 for details.
ldap
Authenticate using an LDAP server. See Section 21.10 for details.
radius
Authenticate using a RADIUS server. See Section 21.11 for details.
cert
Authenticate using SSL client certificates. See Section 21.12 for details.
pam
Authenticate using the Pluggable Authentication Modules (PAM) service provided by the operating system. See Section 21.13 for details.
bsd
Authenticate using the BSD Authentication service provided by the operating system. See Section 21.14 for details.
auth-options
After the auth-method
field, there can be field(s) of
the form name
=
value
that
specify options for the authentication method. Details about which
options are available for which authentication methods appear below.
In addition to the method-specific options listed below, there is a
method-independent authentication option clientcert
, which
can be specified in any hostssl
record.
This option can be set to verify-ca
or
verify-full
. Both options require the client
to present a valid (trusted) SSL certificate, while
verify-full
additionally enforces that the
cn
(Common Name) in the certificate matches
the username or an applicable mapping.
This behavior is similar to the cert
authentication
method (see Section 21.12) but enables pairing
the verification of client certificates with any authentication
method that supports hostssl
entries.
On any record using client certificate authentication (i.e. one
using the cert
authentication method or one
using the clientcert
option), you can specify
which part of the client certificate credentials to match using
the clientname
option. This option can have one
of two values. If you specify clientname=CN
, which
is the default, the username is matched against the certificate's
Common Name (CN)
. If instead you specify
clientname=DN
the username is matched against the
entire Distinguished Name (DN)
of the certificate.
This option is probably best used in conjunction with a username map.
The comparison is done with the DN
in
RFC 2253
format. To see the DN
of a client certificate
in this format, do
openssl x509 -in myclient.crt -noout -subject -nameopt RFC2253 | sed "s/^subject=//"
Care needs to be taken when using this option, especially when using
regular expression matching against the DN
.
Files included by @
constructs are read as lists of names,
which can be separated by either whitespace or commas. Comments are
introduced by #
, just as in
pg_hba.conf
, and nested @
constructs are
allowed. Unless the file name following @
is an absolute
path, it is taken to be relative to the directory containing the
referencing file.
Since the pg_hba.conf
records are examined
sequentially for each connection attempt, the order of the records is
significant. Typically, earlier records will have tight connection
match parameters and weaker authentication methods, while later
records will have looser match parameters and stronger authentication
methods. For example, one might wish to use trust
authentication for local TCP/IP connections but require a password for
remote TCP/IP connections. In this case a record specifying
trust
authentication for connections from 127.0.0.1 would
appear before a record specifying password authentication for a wider
range of allowed client IP addresses.
The pg_hba.conf
file is read on start-up and when
the main server process receives a
SIGHUP
signal. If you edit the file on an
active system, you will need to signal the postmaster
(using pg_ctl reload
, calling the SQL function
pg_reload_conf()
, or using kill
-HUP
) to make it re-read the file.
The preceding statement is not true on Microsoft Windows: there, any
changes in the pg_hba.conf
file are immediately
applied by subsequent new connections.
The system view
pg_hba_file_rules
can be helpful for pre-testing changes to the pg_hba.conf
file, or for diagnosing problems if loading of the file did not have the
desired effects. Rows in the view with
non-null error
fields indicate problems in the
corresponding lines of the file.
To connect to a particular database, a user must not only pass the
pg_hba.conf
checks, but must have the
CONNECT
privilege for the database. If you wish to
restrict which users can connect to which databases, it's usually
easier to control this by granting/revoking CONNECT
privilege
than to put the rules in pg_hba.conf
entries.
Some examples of pg_hba.conf
entries are shown in
Example 21.1. See the next section for details on the
different authentication methods.
Example 21.1. Example pg_hba.conf
Entries
# Allow any user on the local system to connect to any database with # any database user name using Unix-domain sockets (the default for local # connections). # # TYPE DATABASE USER ADDRESS METHOD local all all trust # The same using local loopback TCP/IP connections. # # TYPE DATABASE USER ADDRESS METHOD host all all 127.0.0.1/32 trust # The same as the previous line, but using a separate netmask column # # TYPE DATABASE USER IP-ADDRESS IP-MASK METHOD host all all 127.0.0.1 255.255.255.255 trust # The same over IPv6. # # TYPE DATABASE USER ADDRESS METHOD host all all ::1/128 trust # The same using a host name (would typically cover both IPv4 and IPv6). # # TYPE DATABASE USER ADDRESS METHOD host all all localhost trust # Allow any user from any host with IP address 192.168.93.x to connect # to database "postgres" as the same user name that ident reports for # the connection (typically the operating system user name). # # TYPE DATABASE USER ADDRESS METHOD host postgres all 192.168.93.0/24 ident # Allow any user from host 192.168.12.10 to connect to database # "postgres" if the user's password is correctly supplied. # # TYPE DATABASE USER ADDRESS METHOD host postgres all 192.168.12.10/32 scram-sha-256 # Allow any user from hosts in the example.com domain to connect to # any database if the user's password is correctly supplied. # # Require SCRAM authentication for most users, but make an exception # for user 'mike', who uses an older client that doesn't support SCRAM # authentication. # # TYPE DATABASE USER ADDRESS METHOD host all mike .example.com md5 host all all .example.com scram-sha-256 # In the absence of preceding "host" lines, these three lines will # reject all connections from 192.168.54.1 (since that entry will be # matched first), but allow GSSAPI-encrypted connections from anywhere else # on the Internet. The zero mask causes no bits of the host IP address to # be considered, so it matches any host. Unencrypted GSSAPI connections # (which "fall through" to the third line since "hostgssenc" only matches # encrypted GSSAPI connections) are allowed, but only from 192.168.12.10. # # TYPE DATABASE USER ADDRESS METHOD host all all 192.168.54.1/32 reject hostgssenc all all 0.0.0.0/0 gss host all all 192.168.12.10/32 gss # Allow users from 192.168.x.x hosts to connect to any database, if # they pass the ident check. If, for example, ident says the user is # "bryanh" and he requests to connect as PostgreSQL user "guest1", the # connection is allowed if there is an entry in pg_ident.conf for map # "omicron" that says "bryanh" is allowed to connect as "guest1". # # TYPE DATABASE USER ADDRESS METHOD host all all 192.168.0.0/16 ident map=omicron # If these are the only three lines for local connections, they will # allow local users to connect only to their own databases (databases # with the same name as their database user name) except for administrators # and members of role "support", who can connect to all databases. The file # $PGDATA/admins contains a list of names of administrators. Passwords # are required in all cases. # # TYPE DATABASE USER ADDRESS METHOD local sameuser all md5 local all @admins md5 local all +support md5 # The last two lines above can be combined into a single line: local all @admins,+support md5 # The database column can also use lists and file names: local db1,db2,@demodbs all md5
When using an external authentication system such as Ident or GSSAPI,
the name of the operating system user that initiated the connection
might not be the same as the database user (role) that is to be used.
In this case, a user name map can be applied to map the operating system
user name to a database user. To use user name mapping, specify
map
=map-name
in the options field in pg_hba.conf
. This option is
supported for all authentication methods that receive external user names.
Since different mappings might be needed for different connections,
the name of the map to be used is specified in the
map-name
parameter in pg_hba.conf
to indicate which map to use for each individual connection.
User name maps are defined in the ident map file, which by default is named
pg_ident.conf
and is stored in the
cluster's data directory. (It is possible to place the map file
elsewhere, however; see the ident_file
configuration parameter.)
The ident map file contains lines of the general form:
map-name
system-username
database-username
Comments, whitespace and line continuations are handled in the same way as in
pg_hba.conf
. The
map-name
is an arbitrary name that will be used to
refer to this mapping in pg_hba.conf
. The other
two fields specify an operating system user name and a matching
database user name. The same map-name
can be
used repeatedly to specify multiple user-mappings within a single map.
There is no restriction regarding how many database users a given operating system user can correspond to, nor vice versa. Thus, entries in a map should be thought of as meaning “this operating system user is allowed to connect as this database user”, rather than implying that they are equivalent. The connection will be allowed if there is any map entry that pairs the user name obtained from the external authentication system with the database user name that the user has requested to connect as.
If the system-username
field starts with a slash (/
),
the remainder of the field is treated as a regular expression.
(See Section 9.7.3.1 for details of
PostgreSQL's regular expression syntax.) The regular
expression can include a single capture, or parenthesized subexpression,
which can then be referenced in the database-username
field as \1
(backslash-one). This allows the mapping of
multiple user names in a single line, which is particularly useful for
simple syntax substitutions. For example, these entries
mymap /^(.*)@mydomain\.com$ \1 mymap /^(.*)@otherdomain\.com$ guest
will remove the domain part for users with system user names that end with
@mydomain.com
, and allow any user whose system name ends with
@otherdomain.com
to log in as guest
.
Keep in mind that by default, a regular expression can match just part of
a string. It's usually wise to use ^
and $
, as
shown in the above example, to force the match to be to the entire
system user name.
The pg_ident.conf
file is read on start-up and
when the main server process receives a
SIGHUP
signal. If you edit the file on an
active system, you will need to signal the postmaster
(using pg_ctl reload
, calling the SQL function
pg_reload_conf()
, or using kill
-HUP
) to make it re-read the file.
A pg_ident.conf
file that could be used in
conjunction with the pg_hba.conf
file in Example 21.1 is shown in Example 21.2. In this example, anyone
logged in to a machine on the 192.168 network that does not have the
operating system user name bryanh
, ann
, or
robert
would not be granted access. Unix user
robert
would only be allowed access when he tries to
connect as PostgreSQL user bob
, not
as robert
or anyone else. ann
would
only be allowed to connect as ann
. User
bryanh
would be allowed to connect as either
bryanh
or as guest1
.
Example 21.2. An Example pg_ident.conf
File
# MAPNAME SYSTEM-USERNAME PG-USERNAME omicron bryanh bryanh omicron ann ann # bob has user name robert on these machines omicron robert bob # bryanh can also connect as guest1 omicron bryanh guest1
PostgreSQL provides various methods for authenticating users:
Trust authentication, which simply trusts that users are who they say they are.
Password authentication, which requires that users send a password.
GSSAPI authentication, which relies on a GSSAPI-compatible security library. Typically this is used to access an authentication server such as a Kerberos or Microsoft Active Directory server.
SSPI authentication, which uses a Windows-specific protocol similar to GSSAPI.
Ident authentication, which relies on an “Identification Protocol” (RFC 1413) service on the client's machine. (On local Unix-socket connections, this is treated as peer authentication.)
Peer authentication, which relies on operating system facilities to identify the process at the other end of a local connection. This is not supported for remote connections.
LDAP authentication, which relies on an LDAP authentication server.
RADIUS authentication, which relies on a RADIUS authentication server.
Certificate authentication, which requires an SSL connection and authenticates users by checking the SSL certificate they send.
PAM authentication, which relies on a PAM (Pluggable Authentication Modules) library.
BSD authentication, which relies on the BSD Authentication framework (currently available only on OpenBSD).
Peer authentication is usually recommendable for local connections, though trust authentication might be sufficient in some circumstances. Password authentication is the easiest choice for remote connections. All the other options require some kind of external security infrastructure (usually an authentication server or a certificate authority for issuing SSL certificates), or are platform-specific.
The following sections describe each of these authentication methods in more detail.
When trust
authentication is specified,
PostgreSQL assumes that anyone who can
connect to the server is authorized to access the database with
whatever database user name they specify (even superuser names).
Of course, restrictions made in the database
and
user
columns still apply.
This method should only be used when there is adequate
operating-system-level protection on connections to the server.
trust
authentication is appropriate and very
convenient for local connections on a single-user workstation. It
is usually not appropriate by itself on a multiuser
machine. However, you might be able to use trust
even
on a multiuser machine, if you restrict access to the server's
Unix-domain socket file using file-system permissions. To do this, set the
unix_socket_permissions
(and possibly
unix_socket_group
) configuration parameters as
described in Section 20.3. Or you
could set the unix_socket_directories
configuration parameter to place the socket file in a suitably
restricted directory.
Setting file-system permissions only helps for Unix-socket connections.
Local TCP/IP connections are not restricted by file-system permissions.
Therefore, if you want to use file-system permissions for local security,
remove the host ... 127.0.0.1 ...
line from
pg_hba.conf
, or change it to a
non-trust
authentication method.
trust
authentication is only suitable for TCP/IP connections
if you trust every user on every machine that is allowed to connect
to the server by the pg_hba.conf
lines that specify
trust
. It is seldom reasonable to use trust
for any TCP/IP connections other than those from localhost (127.0.0.1).
There are several password-based authentication methods. These methods operate similarly but differ in how the users' passwords are stored on the server and how the password provided by a client is sent across the connection.
scram-sha-256
The method scram-sha-256
performs SCRAM-SHA-256
authentication, as described in
RFC 7677. It
is a challenge-response scheme that prevents password sniffing on
untrusted connections and supports storing passwords on the server in a
cryptographically hashed form that is thought to be secure.
This is the most secure of the currently provided methods, but it is not supported by older client libraries.
md5
The method md5
uses a custom less secure challenge-response
mechanism. It prevents password sniffing and avoids storing passwords
on the server in plain text but provides no protection if an attacker
manages to steal the password hash from the server. Also, the MD5 hash
algorithm is nowadays no longer considered secure against determined
attacks.
The md5
method cannot be used with
the db_user_namespace feature.
To ease transition from the md5
method to the newer
SCRAM method, if md5
is specified as a method
in pg_hba.conf
but the user's password on the
server is encrypted for SCRAM (see below), then SCRAM-based
authentication will automatically be chosen instead.
password
The method password
sends the password in clear-text and is
therefore vulnerable to password “sniffing” attacks. It should
always be avoided if possible. If the connection is protected by SSL
encryption then password
can be used safely, though.
(Though SSL certificate authentication might be a better choice if one
is depending on using SSL).
PostgreSQL database passwords are
separate from operating system user passwords. The password for
each database user is stored in the pg_authid
system
catalog. Passwords can be managed with the SQL commands
CREATE ROLE and
ALTER ROLE,
e.g., CREATE ROLE foo WITH LOGIN PASSWORD 'secret'
,
or the psql
command \password
.
If no password has been set up for a user, the stored password
is null and password authentication will always fail for that user.
The availability of the different password-based authentication methods
depends on how a user's password on the server is encrypted (or hashed,
more accurately). This is controlled by the configuration
parameter password_encryption at the time the
password is set. If a password was encrypted using
the scram-sha-256
setting, then it can be used for the
authentication methods scram-sha-256
and password
(but password transmission will be in
plain text in the latter case). The authentication method
specification md5
will automatically switch to using
the scram-sha-256
method in this case, as explained
above, so it will also work. If a password was encrypted using
the md5
setting, then it can be used only for
the md5
and password
authentication
method specifications (again, with the password transmitted in plain text
in the latter case). (Previous PostgreSQL releases supported storing the
password on the server in plain text. This is no longer possible.) To
check the currently stored password hashes, see the system
catalog pg_authid
.
To upgrade an existing installation from md5
to scram-sha-256
, after having ensured that all client
libraries in use are new enough to support SCRAM,
set password_encryption = 'scram-sha-256'
in postgresql.conf
, make all users set new passwords,
and change the authentication method specifications
in pg_hba.conf
to scram-sha-256
.
GSSAPI is an industry-standard protocol for secure authentication defined in RFC 2743. PostgreSQL supports GSSAPI for authentication, communications encryption, or both. GSSAPI provides automatic authentication (single sign-on) for systems that support it. The authentication itself is secure. If GSSAPI encryption or SSL encryption is used, the data sent along the database connection will be encrypted; otherwise, it will not.
GSSAPI support has to be enabled when PostgreSQL is built; see Chapter 17 for more information.
When GSSAPI uses
Kerberos, it uses a standard service
principal (authentication identity) name in the format
.
The principal name used by a particular installation is not encoded in
the PostgreSQL server in any way; rather it
is specified in the keytab file that the server
reads to determine its identity. If multiple principals are listed in
the keytab file, the server will accept any one of them.
The server's realm name is the preferred realm specified in the Kerberos
configuration file(s) accessible to the server.
servicename
/hostname
@realm
When connecting, the client must know the principal name of the server
it intends to connect to. The servicename
part of the principal is ordinarily postgres
,
but another value can be selected via libpq's
krbsrvname connection parameter.
The hostname
part is the fully qualified
host name that libpq is told to connect to.
The realm name is the preferred realm specified in the Kerberos
configuration file(s) accessible to the client.
The client will also have a principal name for its own identity
(and it must have a valid ticket for this principal). To
use GSSAPI for authentication, the client
principal must be associated with
a PostgreSQL database user name.
The pg_ident.conf
configuration file can be used
to map principals to user names; for example,
pgusername@realm
could be mapped to just pgusername
.
Alternatively, you can use the full username@realm
principal as
the role name in PostgreSQL without any mapping.
PostgreSQL also supports mapping
client principals to user names by just stripping the realm from
the principal. This method is supported for backwards compatibility and is
strongly discouraged as it is then impossible to distinguish different users
with the same user name but coming from different realms. To enable this,
set include_realm
to 0. For simple single-realm
installations, doing that combined with setting the
krb_realm
parameter (which checks that the principal's realm
matches exactly what is in the krb_realm
parameter)
is still secure; but this is a
less capable approach compared to specifying an explicit mapping in
pg_ident.conf
.
The location of the server's keytab file is specified by the krb_server_keyfile configuration parameter. For security reasons, it is recommended to use a separate keytab just for the PostgreSQL server rather than allowing the server to read the system keytab file. Make sure that your server keytab file is readable (and preferably only readable, not writable) by the PostgreSQL server account. (See also Section 19.1.)
The keytab file is generated using the Kerberos software; see the Kerberos documentation for details. The following example shows doing this using the kadmin tool of MIT-compatible Kerberos 5 implementations:
kadmin%
addprinc -randkey postgres/server.my.domain.org
kadmin%
ktadd -k krb5.keytab postgres/server.my.domain.org
The following authentication options are supported for the GSSAPI authentication method:
include_realm
If set to 0, the realm name from the authenticated user principal is
stripped off before being passed through the user name mapping
(Section 21.2). This is discouraged and is
primarily available for backwards compatibility, as it is not secure
in multi-realm environments unless krb_realm
is
also used. It is recommended to
leave include_realm
set to the default (1) and to
provide an explicit mapping in pg_ident.conf
to convert
principal names to PostgreSQL user names.
map
Allows mapping from client principals to database user names. See
Section 21.2 for details. For a GSSAPI/Kerberos
principal, such as username@EXAMPLE.COM
(or, less
commonly, username/hostbased@EXAMPLE.COM
), the
user name used for mapping is
username@EXAMPLE.COM
(or
username/hostbased@EXAMPLE.COM
, respectively),
unless include_realm
has been set to 0, in which case
username
(or username/hostbased
)
is what is seen as the system user name when mapping.
krb_realm
Sets the realm to match user principal names against. If this parameter is set, only users of that realm will be accepted. If it is not set, users of any realm can connect, subject to whatever user name mapping is done.
In addition to these settings, which can be different for
different pg_hba.conf
entries, there is the
server-wide krb_caseins_users configuration
parameter. If that is set to true, client principals are matched to
user map entries case-insensitively. krb_realm
, if
set, is also matched case-insensitively.
SSPI is a Windows
technology for secure authentication with single sign-on.
PostgreSQL will use SSPI in
negotiate
mode, which will use
Kerberos when possible and automatically
fall back to NTLM in other cases.
SSPI and GSSAPI
interoperate as clients and servers, e.g., an
SSPI client can authenticate to an
GSSAPI server. It is recommended to use
SSPI on Windows clients and servers and
GSSAPI on non-Windows platforms.
When using Kerberos authentication, SSPI works the same way GSSAPI does; see Section 21.6 for details.
The following configuration options are supported for SSPI:
include_realm
If set to 0, the realm name from the authenticated user principal is
stripped off before being passed through the user name mapping
(Section 21.2). This is discouraged and is
primarily available for backwards compatibility, as it is not secure
in multi-realm environments unless krb_realm
is
also used. It is recommended to
leave include_realm
set to the default (1) and to
provide an explicit mapping in pg_ident.conf
to convert
principal names to PostgreSQL user names.
compat_realm
If set to 1, the domain's SAM-compatible name (also known as the
NetBIOS name) is used for the include_realm
option. This is the default. If set to 0, the true realm name from
the Kerberos user principal name is used.
Do not disable this option unless your server runs under a domain account (this includes virtual service accounts on a domain member system) and all clients authenticating through SSPI are also using domain accounts, or authentication will fail.
upn_username
If this option is enabled along with compat_realm
,
the user name from the Kerberos UPN is used for authentication. If
it is disabled (the default), the SAM-compatible user name is used.
By default, these two names are identical for new user accounts.
Note that libpq uses the SAM-compatible name if no explicit user name is specified. If you use libpq or a driver based on it, you should leave this option disabled or explicitly specify user name in the connection string.
map
Allows for mapping between system and database user names. See
Section 21.2 for details. For an SSPI/Kerberos
principal, such as username@EXAMPLE.COM
(or, less
commonly, username/hostbased@EXAMPLE.COM
), the
user name used for mapping is
username@EXAMPLE.COM
(or
username/hostbased@EXAMPLE.COM
, respectively),
unless include_realm
has been set to 0, in which case
username
(or username/hostbased
)
is what is seen as the system user name when mapping.
krb_realm
Sets the realm to match user principal names against. If this parameter is set, only users of that realm will be accepted. If it is not set, users of any realm can connect, subject to whatever user name mapping is done.
The ident authentication method works by obtaining the client's operating system user name from an ident server and using it as the allowed database user name (with an optional user name mapping). This is only supported on TCP/IP connections.
When ident is specified for a local (non-TCP/IP) connection, peer authentication (see Section 21.9) will be used instead.
The following configuration options are supported for ident:
map
Allows for mapping between system and database user names. See Section 21.2 for details.
The “Identification Protocol” is described in
RFC 1413.
Virtually every Unix-like
operating system ships with an ident server that listens on TCP
port 113 by default. The basic functionality of an ident server
is to answer questions like “What user initiated the
connection that goes out of your port X
and connects to my port Y
?”.
Since PostgreSQL knows both X
and
Y
when a physical connection is established, it
can interrogate the ident server on the host of the connecting
client and can theoretically determine the operating system user
for any given connection.
The drawback of this procedure is that it depends on the integrity of the client: if the client machine is untrusted or compromised, an attacker could run just about any program on port 113 and return any user name they choose. This authentication method is therefore only appropriate for closed networks where each client machine is under tight control and where the database and system administrators operate in close contact. In other words, you must trust the machine running the ident server. Heed the warning:
The Identification Protocol is not intended as an authorization or access control protocol. | ||
--RFC 1413 |
Some ident servers have a nonstandard option that causes the returned user name to be encrypted, using a key that only the originating machine's administrator knows. This option must not be used when using the ident server with PostgreSQL, since PostgreSQL does not have any way to decrypt the returned string to determine the actual user name.
The peer authentication method works by obtaining the client's operating system user name from the kernel and using it as the allowed database user name (with optional user name mapping). This method is only supported on local connections.
The following configuration options are supported for peer:
map
Allows for mapping between system and database user names. See Section 21.2 for details.
Peer authentication is only available on operating systems providing
the getpeereid()
function, the SO_PEERCRED
socket parameter, or similar mechanisms. Currently that includes
Linux,
most flavors of BSD including
macOS,
and Solaris.
This authentication method operates similarly to
password
except that it uses LDAP
as the password verification method. LDAP is used only to validate
the user name/password pairs. Therefore the user must already
exist in the database before LDAP can be used for
authentication.
LDAP authentication can operate in two modes. In the first mode,
which we will call the simple bind mode,
the server will bind to the distinguished name constructed as
prefix
username
suffix
.
Typically, the prefix
parameter is used to specify
cn=
, or DOMAIN
\
in an Active
Directory environment. suffix
is used to specify the
remaining part of the DN in a non-Active Directory environment.
In the second mode, which we will call the search+bind mode,
the server first binds to the LDAP directory with
a fixed user name and password, specified with ldapbinddn
and ldapbindpasswd
, and performs a search for the user trying
to log in to the database. If no user and password is configured, an
anonymous bind will be attempted to the directory. The search will be
performed over the subtree at ldapbasedn
, and will try to
do an exact match of the attribute specified in
ldapsearchattribute
.
Once the user has been found in
this search, the server disconnects and re-binds to the directory as
this user, using the password specified by the client, to verify that the
login is correct. This mode is the same as that used by LDAP authentication
schemes in other software, such as Apache mod_authnz_ldap
and pam_ldap
.
This method allows for significantly more flexibility
in where the user objects are located in the directory, but will cause
two separate connections to the LDAP server to be made.
The following configuration options are used in both modes:
ldapserver
Names or IP addresses of LDAP servers to connect to. Multiple servers may be specified, separated by spaces.
ldapport
Port number on LDAP server to connect to. If no port is specified, the LDAP library's default port setting will be used.
ldapscheme
Set to ldaps
to use LDAPS. This is a non-standard
way of using LDAP over SSL, supported by some LDAP server
implementations. See also the ldaptls
option for
an alternative.
ldaptls
Set to 1 to make the connection between PostgreSQL and the LDAP server
use TLS encryption. This uses the StartTLS
operation per RFC 4513.
See also the ldapscheme
option for an alternative.
Note that using ldapscheme
or
ldaptls
only encrypts the traffic between the
PostgreSQL server and the LDAP server. The connection between the
PostgreSQL server and the PostgreSQL client will still be unencrypted
unless SSL is used there as well.
The following options are used in simple bind mode only:
ldapprefix
String to prepend to the user name when forming the DN to bind as, when doing simple bind authentication.
ldapsuffix
String to append to the user name when forming the DN to bind as, when doing simple bind authentication.
The following options are used in search+bind mode only:
ldapbasedn
Root DN to begin the search for the user in, when doing search+bind authentication.
ldapbinddn
DN of user to bind to the directory with to perform the search when doing search+bind authentication.
ldapbindpasswd
Password for user to bind to the directory with to perform the search when doing search+bind authentication.
ldapsearchattribute
Attribute to match against the user name in the search when doing
search+bind authentication. If no attribute is specified, the
uid
attribute will be used.
ldapsearchfilter
The search filter to use when doing search+bind authentication.
Occurrences of $username
will be replaced with the
user name. This allows for more flexible search filters than
ldapsearchattribute
.
ldapurl
An RFC 4516 LDAP URL. This is an alternative way to write some of the other LDAP options in a more compact and standard form. The format is
ldap[s]://host
[:port
]/basedn
[?[attribute
][?[scope
][?[filter
]]]]
scope
must be one
of base
, one
, sub
,
typically the last. (The default is base
, which
is normally not useful in this application.) attribute
can
nominate a single attribute, in which case it is used as a value for
ldapsearchattribute
. If
attribute
is empty then
filter
can be used as a value for
ldapsearchfilter
.
The URL scheme ldaps
chooses the LDAPS method for
making LDAP connections over SSL, equivalent to using
ldapscheme=ldaps
. To use encrypted LDAP
connections using the StartTLS
operation, use the
normal URL scheme ldap
and specify the
ldaptls
option in addition to
ldapurl
.
For non-anonymous binds, ldapbinddn
and ldapbindpasswd
must be specified as separate
options.
LDAP URLs are currently only supported with OpenLDAP, not on Windows.
It is an error to mix configuration options for simple bind with options for search+bind.
When using search+bind mode, the search can be performed using a single
attribute specified with ldapsearchattribute
, or using
a custom search filter specified with
ldapsearchfilter
.
Specifying ldapsearchattribute=foo
is equivalent to
specifying ldapsearchfilter="(foo=$username)"
. If neither
option is specified the default is
ldapsearchattribute=uid
.
If PostgreSQL was compiled with
OpenLDAP as the LDAP client library, the
ldapserver
setting may be omitted. In that case, a
list of host names and ports is looked up via
RFC 2782 DNS SRV records.
The name _ldap._tcp.DOMAIN
is looked up, where
DOMAIN
is extracted from ldapbasedn
.
Here is an example for a simple-bind LDAP configuration:
host ... ldap ldapserver=ldap.example.net ldapprefix="cn=" ldapsuffix=", dc=example, dc=net"
When a connection to the database server as database
user someuser
is requested, PostgreSQL will attempt to
bind to the LDAP server using the DN cn=someuser, dc=example,
dc=net
and the password provided by the client. If that connection
succeeds, the database access is granted.
Here is an example for a search+bind configuration:
host ... ldap ldapserver=ldap.example.net ldapbasedn="dc=example, dc=net" ldapsearchattribute=uid
When a connection to the database server as database
user someuser
is requested, PostgreSQL will attempt to
bind anonymously (since ldapbinddn
was not specified) to
the LDAP server, perform a search for (uid=someuser)
under the specified base DN. If an entry is found, it will then attempt to
bind using that found information and the password supplied by the client.
If that second connection succeeds, the database access is granted.
Here is the same search+bind configuration written as a URL:
host ... ldap ldapurl="ldap://ldap.example.net/dc=example,dc=net?uid?sub"
Some other software that supports authentication against LDAP uses the same URL format, so it will be easier to share the configuration.
Here is an example for a search+bind configuration that uses
ldapsearchfilter
instead of
ldapsearchattribute
to allow authentication by
user ID or email address:
host ... ldap ldapserver=ldap.example.net ldapbasedn="dc=example, dc=net" ldapsearchfilter="(|(uid=$username)(mail=$username))"
Here is an example for a search+bind configuration that uses DNS SRV
discovery to find the host name(s) and port(s) for the LDAP service for the
domain name example.net
:
host ... ldap ldapbasedn="dc=example,dc=net"
Since LDAP often uses commas and spaces to separate the different parts of a DN, it is often necessary to use double-quoted parameter values when configuring LDAP options, as shown in the examples.
This authentication method operates similarly to
password
except that it uses RADIUS
as the password verification method. RADIUS is used only to validate
the user name/password pairs. Therefore the user must already
exist in the database before RADIUS can be used for
authentication.
When using RADIUS authentication, an Access Request message will be sent
to the configured RADIUS server. This request will be of type
Authenticate Only
, and include parameters for
user name
, password
(encrypted) and
NAS Identifier
. The request will be encrypted using
a secret shared with the server. The RADIUS server will respond to
this request with either Access Accept
or
Access Reject
. There is no support for RADIUS accounting.
Multiple RADIUS servers can be specified, in which case they will be tried sequentially. If a negative response is received from a server, the authentication will fail. If no response is received, the next server in the list will be tried. To specify multiple servers, separate the server names with commas and surround the list with double quotes. If multiple servers are specified, the other RADIUS options can also be given as comma-separated lists, to provide individual values for each server. They can also be specified as a single value, in which case that value will apply to all servers.
The following configuration options are supported for RADIUS:
radiusservers
The DNS names or IP addresses of the RADIUS servers to connect to. This parameter is required.
radiussecrets
The shared secrets used when talking securely to the RADIUS servers. This must have exactly the same value on the PostgreSQL and RADIUS servers. It is recommended that this be a string of at least 16 characters. This parameter is required.
The encryption vector used will only be cryptographically strong if PostgreSQL is built with support for OpenSSL. In other cases, the transmission to the RADIUS server should only be considered obfuscated, not secured, and external security measures should be applied if necessary.
radiusports
The port numbers to connect to on the RADIUS servers. If no port
is specified, the default RADIUS port (1812
)
will be used.
radiusidentifiers
The strings to be used as NAS Identifier
in the
RADIUS requests. This parameter can be used, for example, to
identify which database cluster the user is attempting to connect
to, which can be useful for policy matching on
the RADIUS server. If no identifier is specified, the default
postgresql
will be used.
If it is necessary to have a comma or whitespace in a RADIUS parameter value, that can be done by putting double quotes around the value, but it is tedious because two layers of double-quoting are now required. An example of putting whitespace into RADIUS secret strings is:
host ... radius radiusservers="server1,server2" radiussecrets="""secret one"",""secret two"""
This authentication method uses SSL client certificates to perform
authentication. It is therefore only available for SSL connections;
see Section 19.9.2 for SSL configuration instructions.
When using this authentication method, the server will require that
the client provide a valid, trusted certificate. No password prompt
will be sent to the client. The cn
(Common Name)
attribute of the certificate
will be compared to the requested database user name, and if they match
the login will be allowed. User name mapping can be used to allow
cn
to be different from the database user name.
The following configuration options are supported for SSL certificate authentication:
map
Allows for mapping between system and database user names. See Section 21.2 for details.
It is redundant to use the clientcert
option with
cert
authentication because cert
authentication is effectively trust
authentication
with clientcert=verify-full
.
This authentication method operates similarly to
password
except that it uses PAM (Pluggable
Authentication Modules) as the authentication mechanism. The
default PAM service name is postgresql
.
PAM is used only to validate user name/password pairs and optionally the
connected remote host name or IP address. Therefore the user must already
exist in the database before PAM can be used for authentication. For more
information about PAM, please read the
Linux-PAM Page.
The following configuration options are supported for PAM:
pamservice
PAM service name.
pam_use_hostname
Determines whether the remote IP address or the host name is provided
to PAM modules through the PAM_RHOST
item. By
default, the IP address is used. Set this option to 1 to use the
resolved host name instead. Host name resolution can lead to login
delays. (Most PAM configurations don't use this information, so it is
only necessary to consider this setting if a PAM configuration was
specifically created to make use of it.)
If PAM is set up to read /etc/shadow
, authentication
will fail because the PostgreSQL server is started by a non-root
user. However, this is not an issue when PAM is configured to use
LDAP or other authentication methods.
This authentication method operates similarly to
password
except that it uses BSD Authentication
to verify the password. BSD Authentication is used only
to validate user name/password pairs. Therefore the user's role must
already exist in the database before BSD Authentication can be used
for authentication. The BSD Authentication framework is currently
only available on OpenBSD.
BSD Authentication in PostgreSQL uses
the auth-postgresql
login type and authenticates with
the postgresql
login class if that's defined
in login.conf
. By default that login class does not
exist, and PostgreSQL will use the default login class.
To use BSD Authentication, the PostgreSQL user account (that is, the
operating system user running the server) must first be added to
the auth
group. The auth
group
exists by default on OpenBSD systems.
Authentication failures and related problems generally manifest themselves through error messages like the following:
FATAL: no pg_hba.conf entry for host "123.123.123.123", user "andym", database "testdb"
This is what you are most likely to get if you succeed in contacting
the server, but it does not want to talk to you. As the message
suggests, the server refused the connection request because it found
no matching entry in its pg_hba.conf
configuration file.
FATAL: password authentication failed for user "andym"
Messages like this indicate that you contacted the server, and it is
willing to talk to you, but not until you pass the authorization
method specified in the pg_hba.conf
file. Check
the password you are providing, or check your Kerberos or ident
software if the complaint mentions one of those authentication
types.
FATAL: user "andym" does not exist
The indicated database user name was not found.
FATAL: database "testdb" does not exist
The database you are trying to connect to does not exist. Note that if you do not specify a database name, it defaults to the database user name, which might or might not be the right thing.
The server log might contain more information about an authentication failure than is reported to the client. If you are confused about the reason for a failure, check the server log.
Table of Contents
PostgreSQL manages database access permissions using the concept of roles. A role can be thought of as either a database user, or a group of database users, depending on how the role is set up. Roles can own database objects (for example, tables and functions) and can assign privileges on those objects to other roles to control who has access to which objects. Furthermore, it is possible to grant membership in a role to another role, thus allowing the member role to use privileges assigned to another role.
The concept of roles subsumes the concepts of “users” and “groups”. In PostgreSQL versions before 8.1, users and groups were distinct kinds of entities, but now there are only roles. Any role can act as a user, a group, or both.
This chapter describes how to create and manage roles. More information about the effects of role privileges on various database objects can be found in Section 5.7.
Database roles are conceptually completely separate from
operating system users. In practice it might be convenient to
maintain a correspondence, but this is not required. Database roles
are global across a database cluster installation (and not
per individual database). To create a role use the CREATE ROLE
SQL command:
CREATE ROLE name
;
name
follows the rules for SQL
identifiers: either unadorned without special characters, or
double-quoted. (In practice, you will usually want to add additional
options, such as LOGIN
, to the command. More details appear
below.) To remove an existing role, use the analogous
DROP ROLE
command:
DROP ROLE name
;
For convenience, the programs createuser and dropuser are provided as wrappers around these SQL commands that can be called from the shell command line:
createusername
dropusername
To determine the set of existing roles, examine the pg_roles
system catalog, for example
SELECT rolname FROM pg_roles;
The psql program's \du
meta-command
is also useful for listing the existing roles.
In order to bootstrap the database system, a freshly initialized
system always contains one predefined role. This role is always
a “superuser”, and by default (unless altered when running
initdb
) it will have the same name as the
operating system user that initialized the database
cluster. Customarily, this role will be named
postgres
. In order to create more roles you
first have to connect as this initial role.
Every connection to the database server is made using the name of some
particular role, and this role determines the initial access privileges for
commands issued in that connection.
The role name to use for a particular database
connection is indicated by the client that is initiating the
connection request in an application-specific fashion. For example,
the psql
program uses the
-U
command line option to indicate the role to
connect as. Many applications assume the name of the current
operating system user by default (including
createuser
and psql
). Therefore it
is often convenient to maintain a naming correspondence between
roles and operating system users.
The set of database roles a given client connection can connect as is determined by the client authentication setup, as explained in Chapter 21. (Thus, a client is not limited to connect as the role matching its operating system user, just as a person's login name need not match his or her real name.) Since the role identity determines the set of privileges available to a connected client, it is important to carefully configure privileges when setting up a multiuser environment.
A database role can have a number of attributes that define its privileges and interact with the client authentication system.
Only roles that have the LOGIN
attribute can be used
as the initial role name for a database connection. A role with
the LOGIN
attribute can be considered the same
as a “database user”. To create a role with login privilege,
use either:
CREATE ROLEname
LOGIN; CREATE USERname
;
(CREATE USER
is equivalent to CREATE ROLE
except that CREATE USER
includes LOGIN
by
default, while CREATE ROLE
does not.)
A database superuser bypasses all permission checks, except the right
to log in. This is a dangerous privilege and should not be used
carelessly; it is best to do most of your work as a role that is not a
superuser. To create a new database superuser, use CREATE
ROLE
. You must do
this as a role that is already a superuser.
name
SUPERUSER
A role must be explicitly given permission to create databases
(except for superusers, since those bypass all permission
checks). To create such a role, use CREATE ROLE
.
name
CREATEDB
A role must be explicitly given permission to create more roles
(except for superusers, since those bypass all permission
checks). To create such a role, use CREATE ROLE
.
A role with name
CREATEROLECREATEROLE
privilege can alter and drop
other roles, too, as well as grant or revoke membership in them.
Altering a role includes most changes that can be made using
ALTER ROLE
, including, for example, changing
passwords. It also includes modifications to a role that can
be made using the COMMENT
and
SECURITY LABEL
commands.
However, CREATEROLE
does not convey the ability to
create SUPERUSER
roles, nor does it convey any
power over SUPERUSER
roles that already exist.
Furthermore, CREATEROLE
does not convey the power
to create REPLICATION
users, nor the ability to
grant or revoke the REPLICATION
privilege, nor the
ability to modify the role properties of such users. However, it does
allow ALTER ROLE ... SET
and
ALTER ROLE ... RENAME
to be used on
REPLICATION
roles, as well as the use of
COMMENT ON ROLE
,
SECURITY LABEL ON ROLE
,
and DROP ROLE
.
Finally, CREATEROLE
does not
confer the ability to grant or revoke the BYPASSRLS
privilege.
Because the CREATEROLE
privilege allows a user
to grant or revoke membership even in roles to which it does not (yet)
have any access, a CREATEROLE
user can obtain access
to the capabilities of every predefined role in the system, including
highly privileged roles such as
pg_execute_server_program
and
pg_write_server_files
.
A role must explicitly be given permission to initiate streaming
replication (except for superusers, since those bypass all permission
checks). A role used for streaming replication must
have LOGIN
permission as well. To create such a role, use
CREATE ROLE
.
name
REPLICATION
LOGIN
A password is only significant if the client authentication
method requires the user to supply a password when connecting
to the database. The password
and
md5
authentication methods
make use of passwords. Database passwords are separate from
operating system passwords. Specify a password upon role
creation with CREATE ROLE
.
name
PASSWORD 'string
'
A role is given permission to inherit the privileges of roles it is a
member of, by default. However, to create a role without the permission,
use CREATE ROLE
.
name
NOINHERIT
A role must be explicitly given permission to bypass every row-level security (RLS) policy
(except for superusers, since those bypass all permission checks).
To create such a role, use CREATE ROLE
as a superuser.
name
BYPASSRLS
Connection limit can specify how many concurrent connections a role can make.
-1 (the default) means no limit. Specify connection limit upon role creation with
CREATE ROLE
.
name
CONNECTION LIMIT 'integer
'
A role's attributes can be modified after creation with
ALTER ROLE
.
See the reference pages for the CREATE ROLE
and ALTER ROLE commands for details.
A role can also have role-specific defaults for many of the run-time configuration settings described in Chapter 20. For example, if for some reason you want to disable index scans (hint: not a good idea) anytime you connect, you can use:
ALTER ROLE myname SET enable_indexscan TO off;
This will save the setting (but not set it immediately). In
subsequent connections by this role it will appear as though
SET enable_indexscan TO off
had been executed
just before the session started.
You can still alter this setting during the session; it will only
be the default. To remove a role-specific default setting, use
ALTER ROLE
.
Note that role-specific defaults attached to roles without
rolename
RESET varname
LOGIN
privilege are fairly useless, since they will never
be invoked.
It is frequently convenient to group users together to ease management of privileges: that way, privileges can be granted to, or revoked from, a group as a whole. In PostgreSQL this is done by creating a role that represents the group, and then granting membership in the group role to individual user roles.
To set up a group role, first create the role:
CREATE ROLE name
;
Typically a role being used as a group would not have the LOGIN
attribute, though you can set it if you wish.
Once the group role exists, you can add and remove members using the
GRANT
and
REVOKE
commands:
GRANTgroup_role
TOrole1
, ... ; REVOKEgroup_role
FROMrole1
, ... ;
You can grant membership to other group roles, too (since there isn't
really any distinction between group roles and non-group roles). The
database will not let you set up circular membership loops. Also,
it is not permitted to grant membership in a role to
PUBLIC
.
The members of a group role can use the privileges of the role in two
ways. First, every member of a group can explicitly do
SET ROLE
to
temporarily “become” the group role. In this state, the
database session has access to the privileges of the group role rather
than the original login role, and any database objects created are
considered owned by the group role not the login role. Second, member
roles that have the INHERIT
attribute automatically have use
of the privileges of roles of which they are members, including any
privileges inherited by those roles.
As an example, suppose we have done:
CREATE ROLE joe LOGIN INHERIT; CREATE ROLE admin NOINHERIT; CREATE ROLE wheel NOINHERIT; GRANT admin TO joe; GRANT wheel TO admin;
Immediately after connecting as role joe
, a database
session will have use of privileges granted directly to joe
plus any privileges granted to admin
, because joe
“inherits” admin
's privileges. However, privileges
granted to wheel
are not available, because even though
joe
is indirectly a member of wheel
, the
membership is via admin
which has the NOINHERIT
attribute. After:
SET ROLE admin;
the session would have use of only those privileges granted to
admin
, and not those granted to joe
. After:
SET ROLE wheel;
the session would have use of only those privileges granted to
wheel
, and not those granted to either joe
or admin
. The original privilege state can be restored
with any of:
SET ROLE joe; SET ROLE NONE; RESET ROLE;
The SET ROLE
command always allows selecting any role
that the original login role is directly or indirectly a member of.
Thus, in the above example, it is not necessary to become
admin
before becoming wheel
.
In the SQL standard, there is a clear distinction between users and roles,
and users do not automatically inherit privileges while roles do. This
behavior can be obtained in PostgreSQL by giving
roles being used as SQL roles the INHERIT
attribute, while
giving roles being used as SQL users the NOINHERIT
attribute.
However, PostgreSQL defaults to giving all roles
the INHERIT
attribute, for backward compatibility with pre-8.1
releases in which users always had use of permissions granted to groups
they were members of.
The role attributes LOGIN
, SUPERUSER
,
CREATEDB
, and CREATEROLE
can be thought of as
special privileges, but they are never inherited as ordinary privileges
on database objects are. You must actually SET ROLE
to a
specific role having one of these attributes in order to make use of
the attribute. Continuing the above example, we might choose to
grant CREATEDB
and CREATEROLE
to the
admin
role. Then a session connecting as role joe
would not have these privileges immediately, only after doing
SET ROLE admin
.
To destroy a group role, use DROP ROLE
:
DROP ROLE name
;
Any memberships in the group role are automatically revoked (but the member roles are not otherwise affected).
Because roles can own database objects and can hold privileges
to access other objects, dropping a role is often not just a matter of a
quick DROP ROLE
. Any objects owned by the role must
first be dropped or reassigned to other owners; and any permissions
granted to the role must be revoked.
Ownership of objects can be transferred one at a time
using ALTER
commands, for example:
ALTER TABLE bobs_table OWNER TO alice;
Alternatively, the REASSIGN OWNED
command can be
used to reassign ownership of all objects owned by the role-to-be-dropped
to a single other role. Because REASSIGN OWNED
cannot access
objects in other databases, it is necessary to run it in each database
that contains objects owned by the role. (Note that the first
such REASSIGN OWNED
will change the ownership of any
shared-across-databases objects, that is databases or tablespaces, that
are owned by the role-to-be-dropped.)
Once any valuable objects have been transferred to new owners, any
remaining objects owned by the role-to-be-dropped can be dropped with
the DROP OWNED
command. Again, this command cannot
access objects in other databases, so it is necessary to run it in each
database that contains objects owned by the role. Also, DROP
OWNED
will not drop entire databases or tablespaces, so it is
necessary to do that manually if the role owns any databases or
tablespaces that have not been transferred to new owners.
DROP OWNED
also takes care of removing any privileges granted
to the target role for objects that do not belong to it.
Because REASSIGN OWNED
does not touch such objects, it's
typically necessary to run both REASSIGN OWNED
and DROP OWNED
(in that order!) to fully remove the
dependencies of a role to be dropped.
In short then, the most general recipe for removing a role that has been used to own objects is:
REASSIGN OWNED BY doomed_role TO successor_role; DROP OWNED BY doomed_role; -- repeat the above commands in each database of the cluster DROP ROLE doomed_role;
When not all owned objects are to be transferred to the same successor owner, it's best to handle the exceptions manually and then perform the above steps to mop up.
If DROP ROLE
is attempted while dependent objects still
remain, it will issue messages identifying which objects need to be
reassigned or dropped.
PostgreSQL provides a set of predefined roles
that provide access to certain, commonly needed, privileged capabilities
and information. Administrators (including roles that have the
CREATEROLE
privilege) can GRANT
these
roles to users and/or other roles in their environment, providing those
users with access to the specified capabilities and information.
The predefined roles are described in Table 22.1. Note that the specific permissions for each of the roles may change in the future as additional capabilities are added. Administrators should monitor the release notes for changes.
Table 22.1. Predefined Roles
Role | Allowed Access |
---|---|
pg_read_all_data | Read all data (tables, views, sequences), as if having
SELECT rights on those objects, and USAGE rights on
all schemas, even without having it explicitly. This role does not have
the role attribute BYPASSRLS set. If RLS is being
used, an administrator may wish to set BYPASSRLS on
roles which this role is GRANTed to. |
pg_write_all_data | Write all data (tables, views, sequences), as if having
INSERT , UPDATE , and
DELETE rights on those objects, and USAGE rights on
all schemas, even without having it explicitly. This role does not have
the role attribute BYPASSRLS set. If RLS is being
used, an administrator may wish to set BYPASSRLS on
roles which this role is GRANTed to. |
pg_read_all_settings | Read all configuration variables, even those normally visible only to superusers. |
pg_read_all_stats | Read all pg_stat_* views and use various statistics related extensions, even those normally visible only to superusers. |
pg_stat_scan_tables | Execute monitoring functions that may take ACCESS SHARE locks on tables,
potentially for a long time. |
pg_monitor | Read/execute various monitoring views and functions.
This role is a member of pg_read_all_settings ,
pg_read_all_stats and
pg_stat_scan_tables . |
pg_database_owner | None. Membership consists, implicitly, of the current database owner. |
pg_signal_backend | Signal another backend to cancel a query or terminate its session. |
pg_read_server_files | Allow reading files from any location the database can access on the server with COPY and other file-access functions. |
pg_write_server_files | Allow writing to files in any location the database can access on the server with COPY and other file-access functions. |
pg_execute_server_program | Allow executing programs on the database server as the user the database runs as with COPY and other functions which allow executing a server-side program. |
The pg_monitor
, pg_read_all_settings
,
pg_read_all_stats
and pg_stat_scan_tables
roles are intended to allow administrators to easily configure a role for the
purpose of monitoring the database server. They grant a set of common privileges
allowing the role to read various useful configuration settings, statistics and
other system information normally restricted to superusers.
The pg_database_owner
role has one implicit,
situation-dependent member, namely the owner of the current database. The
role conveys no rights at first. Like any role, it can own objects or
receive grants of access privileges. Consequently, once
pg_database_owner
has rights within a template database,
each owner of a database instantiated from that template will exercise those
rights. pg_database_owner
cannot be a member of any
role, and it cannot have non-implicit members.
The pg_signal_backend
role is intended to allow
administrators to enable trusted, but non-superuser, roles to send signals
to other backends. Currently this role enables sending of signals for
canceling a query on another backend or terminating its session. A user
granted this role cannot however send signals to a backend owned by a
superuser. See Section 9.27.2.
The pg_read_server_files
, pg_write_server_files
and
pg_execute_server_program
roles are intended to allow administrators to have
trusted, but non-superuser, roles which are able to access files and run programs on the
database server as the user the database runs as. As these roles are able to access any file on
the server file system, they bypass all database-level permission checks when accessing files
directly and they could be used to gain superuser-level access, therefore
great care should be taken when granting these roles to users.
Care should be taken when granting these roles to ensure they are only used where needed and with the understanding that these roles grant access to privileged information.
Administrators can grant access to these roles to users using the
GRANT
command, for example:
GRANT pg_signal_backend TO admin_user;
Functions, triggers and row-level security policies allow users to insert
code into the backend server that other users might execute
unintentionally. Hence, these mechanisms permit users to “Trojan
horse” others with relative ease. The strongest protection is tight
control over who can define objects. Where that is infeasible, write
queries referring only to objects having trusted owners. Remove
from search_path
the public schema and any other schemas
that permit untrusted users to create objects.
Functions run inside the backend server process with the operating system permissions of the database server daemon. If the programming language used for the function allows unchecked memory accesses, it is possible to change the server's internal data structures. Hence, among many other things, such functions can circumvent any system access controls. Function languages that allow such access are considered “untrusted”, and PostgreSQL allows only superusers to create functions written in those languages.
Table of Contents
Every instance of a running PostgreSQL server manages one or more databases. Databases are therefore the topmost hierarchical level for organizing SQL objects (“database objects”). This chapter describes the properties of databases, and how to create, manage, and destroy them.
A small number of objects, like role, database, and tablespace
names, are defined at the cluster level and stored in the
pg_global
tablespace. Inside the cluster are
multiple databases, which are isolated from each other but can access
cluster-level objects. Inside each database are multiple schemas,
which contain objects like tables and functions. So the full hierarchy
is: cluster, database, schema, table (or some other kind of object,
such as a function).
When connecting to the database server, a client must specify the database name in its connection request. It is not possible to access more than one database per connection. However, clients can open multiple connections to the same database, or different databases. Database-level security has two components: access control (see Section 21.1), managed at the connection level, and authorization control (see Section 5.7), managed via the grant system. Foreign data wrappers (see postgres_fdw) allow for objects within one database to act as proxies for objects in other database or clusters. The older dblink module (see dblink) provides a similar capability. By default, all users can connect to all databases using all connection methods.
If one PostgreSQL server cluster is planned to contain unrelated projects or users that should be, for the most part, unaware of each other, it is recommended to put them into separate databases and adjust authorizations and access controls accordingly. If the projects or users are interrelated, and thus should be able to use each other's resources, they should be put in the same database but probably into separate schemas; this provides a modular structure with namespace isolation and authorization control. More information about managing schemas is in Section 5.9.
While multiple databases can be created within a single cluster, it is advised to consider carefully whether the benefits outweigh the risks and limitations. In particular, the impact that having a shared WAL (see Chapter 30) has on backup and recovery options. While individual databases in the cluster are isolated when considered from the user's perspective, they are closely bound from the database administrator's point-of-view.
Databases are created with the CREATE DATABASE
command
(see Section 23.2) and destroyed with the
DROP DATABASE
command
(see Section 23.5).
To determine the set of existing databases, examine the
pg_database
system catalog, for example
SELECT datname FROM pg_database;
The psql program's \l
meta-command
and -l
command-line option are also useful for listing the
existing databases.
The SQL standard calls databases “catalogs”, but there is no difference in practice.
In order to create a database, the PostgreSQL server must be up and running (see Section 19.3).
Databases are created with the SQL command CREATE DATABASE:
CREATE DATABASE name
;
where name
follows the usual rules for
SQL identifiers. The current role automatically
becomes the owner of the new database. It is the privilege of the
owner of a database to remove it later (which also removes all
the objects in it, even if they have a different owner).
The creation of databases is a restricted operation. See Section 22.2 for how to grant permission.
Since you need to be connected to the database server in order to
execute the CREATE DATABASE
command, the
question remains how the first database at any given
site can be created. The first database is always created by the
initdb
command when the data storage area is
initialized. (See Section 19.2.) This
database is called
postgres
. So to
create the first “ordinary” database you can connect to
postgres
.
A second database,
template1
,
is also created during database cluster initialization. Whenever a
new database is created within the
cluster, template1
is essentially cloned.
This means that any changes you make in template1
are
propagated to all subsequently created databases. Because of this,
avoid creating objects in template1
unless you want them
propagated to every newly created database. More details
appear in Section 23.3.
As a convenience, there is a program you can
execute from the shell to create new databases,
createdb
.
createdb dbname
createdb
does no magic. It connects to the postgres
database and issues the CREATE DATABASE
command,
exactly as described above.
The createdb reference page contains the invocation
details. Note that createdb
without any arguments will create
a database with the current user name.
Chapter 21 contains information about how to restrict who can connect to a given database.
Sometimes you want to create a database for someone else, and have them become the owner of the new database, so they can configure and manage it themselves. To achieve that, use one of the following commands:
CREATE DATABASEdbname
OWNERrolename
;
from the SQL environment, or:
createdb -Orolename
dbname
from the shell. Only the superuser is allowed to create a database for someone else (that is, for a role you are not a member of).
CREATE DATABASE
actually works by copying an existing
database. By default, it copies the standard system database named
template1
. Thus that
database is the “template” from which new databases are
made. If you add objects to template1
, these objects
will be copied into subsequently created user databases. This
behavior allows site-local modifications to the standard set of
objects in databases. For example, if you install the procedural
language PL/Perl in template1
, it will
automatically be available in user databases without any extra
action being taken when those databases are created.
However, CREATE DATABASE
does not copy database-level
GRANT
permissions attached to the source database.
The new database has default database-level permissions.
There is a second standard system database named
template0
. This
database contains the same data as the initial contents of
template1
, that is, only the standard objects
predefined by your version of
PostgreSQL. template0
should never be changed after the database cluster has been
initialized. By instructing
CREATE DATABASE
to copy template0
instead
of template1
, you can create a “pristine” user
database (one where no user-defined objects exist and where the system
objects have not been altered) that contains none of the site-local additions in
template1
. This is particularly handy when restoring a
pg_dump
dump: the dump script should be restored in a
pristine database to ensure that one recreates the correct contents
of the dumped database, without conflicting with objects that
might have been added to template1
later on.
Another common reason for copying template0
instead
of template1
is that new encoding and locale settings
can be specified when copying template0
, whereas a copy
of template1
must use the same settings it does.
This is because template1
might contain encoding-specific
or locale-specific data, while template0
is known not to.
To create a database by copying template0
, use:
CREATE DATABASE dbname
TEMPLATE template0;
from the SQL environment, or:
createdb -T template0 dbname
from the shell.
It is possible to create additional template databases, and indeed
one can copy any database in a cluster by specifying its name
as the template for CREATE DATABASE
. It is important to
understand, however, that this is not (yet) intended as
a general-purpose “COPY DATABASE
” facility.
The principal limitation is that no other sessions can be connected to
the source database while it is being copied. CREATE
DATABASE
will fail if any other connection exists when it starts;
during the copy operation, new connections to the source database
are prevented.
Two useful flags exist in pg_database
for each
database: the columns datistemplate
and
datallowconn
. datistemplate
can be set to indicate that a database is intended as a template for
CREATE DATABASE
. If this flag is set, the database can be
cloned by any user with CREATEDB
privileges; if it is not set,
only superusers and the owner of the database can clone it.
If datallowconn
is false, then no new connections
to that database will be allowed (but existing sessions are not terminated
simply by setting the flag false). The template0
database is normally marked datallowconn = false
to prevent its modification.
Both template0
and template1
should always be marked with datistemplate = true
.
template1
and template0
do not have any special
status beyond the fact that the name template1
is the default
source database name for CREATE DATABASE
.
For example, one could drop template1
and recreate it from
template0
without any ill effects. This course of action
might be advisable if one has carelessly added a bunch of junk in
template1
. (To delete template1
,
it must have pg_database.datistemplate = false
.)
The postgres
database is also created when a database
cluster is initialized. This database is meant as a default database for
users and applications to connect to. It is simply a copy of
template1
and can be dropped and recreated if necessary.
Recall from Chapter 20 that the PostgreSQL server provides a large number of run-time configuration variables. You can set database-specific default values for many of these settings.
For example, if for some reason you want to disable the
GEQO optimizer for a given database, you'd
ordinarily have to either disable it for all databases or make sure
that every connecting client is careful to issue SET geqo
TO off
. To make this setting the default within a particular
database, you can execute the command:
ALTER DATABASE mydb SET geqo TO off;
This will save the setting (but not set it immediately). In
subsequent connections to this database it will appear as though
SET geqo TO off;
had been executed just before the
session started.
Note that users can still alter this setting during their sessions; it
will only be the default. To undo any such setting, use
ALTER DATABASE
.
dbname
RESET
varname
Databases are destroyed with the command DROP DATABASE:
DROP DATABASE name
;
Only the owner of the database, or a superuser, can drop a database. Dropping a database removes all objects that were contained within the database. The destruction of a database cannot be undone.
You cannot execute the DROP DATABASE
command
while connected to the victim database. You can, however, be
connected to any other database, including the template1
database.
template1
would be the only option for dropping the last user database of a
given cluster.
For convenience, there is also a shell program to drop databases, dropdb:
dropdb dbname
(Unlike createdb
, it is not the default action to drop
the database with the current user name.)
Tablespaces in PostgreSQL allow database administrators to define locations in the file system where the files representing database objects can be stored. Once created, a tablespace can be referred to by name when creating database objects.
By using tablespaces, an administrator can control the disk layout of a PostgreSQL installation. This is useful in at least two ways. First, if the partition or volume on which the cluster was initialized runs out of space and cannot be extended, a tablespace can be created on a different partition and used until the system can be reconfigured.
Second, tablespaces allow an administrator to use knowledge of the usage pattern of database objects to optimize performance. For example, an index which is very heavily used can be placed on a very fast, highly available disk, such as an expensive solid state device. At the same time a table storing archived data which is rarely used or not performance critical could be stored on a less expensive, slower disk system.
Even though located outside the main PostgreSQL data directory, tablespaces are an integral part of the database cluster and cannot be treated as an autonomous collection of data files. They are dependent on metadata contained in the main data directory, and therefore cannot be attached to a different database cluster or backed up individually. Similarly, if you lose a tablespace (file deletion, disk failure, etc), the database cluster might become unreadable or unable to start. Placing a tablespace on a temporary file system like a RAM disk risks the reliability of the entire cluster.
To define a tablespace, use the CREATE TABLESPACE command, for example::
CREATE TABLESPACE fastspace LOCATION '/ssd1/postgresql/data';
The location must be an existing, empty directory that is owned by the PostgreSQL operating system user. All objects subsequently created within the tablespace will be stored in files underneath this directory. The location must not be on removable or transient storage, as the cluster might fail to function if the tablespace is missing or lost.
There is usually not much point in making more than one tablespace per logical file system, since you cannot control the location of individual files within a logical file system. However, PostgreSQL does not enforce any such limitation, and indeed it is not directly aware of the file system boundaries on your system. It just stores files in the directories you tell it to use.
Creation of the tablespace itself must be done as a database superuser,
but after that you can allow ordinary database users to use it.
To do that, grant them the CREATE
privilege on it.
Tables, indexes, and entire databases can be assigned to
particular tablespaces. To do so, a user with the CREATE
privilege on a given tablespace must pass the tablespace name as a
parameter to the relevant command. For example, the following creates
a table in the tablespace space1
:
CREATE TABLE foo(i int) TABLESPACE space1;
Alternatively, use the default_tablespace parameter:
SET default_tablespace = space1; CREATE TABLE foo(i int);
When default_tablespace
is set to anything but an empty
string, it supplies an implicit TABLESPACE
clause for
CREATE TABLE
and CREATE INDEX
commands that
do not have an explicit one.
There is also a temp_tablespaces parameter, which determines the placement of temporary tables and indexes, as well as temporary files that are used for purposes such as sorting large data sets. This can be a list of tablespace names, rather than only one, so that the load associated with temporary objects can be spread over multiple tablespaces. A random member of the list is picked each time a temporary object is to be created.
The tablespace associated with a database is used to store the system
catalogs of that database. Furthermore, it is the default tablespace
used for tables, indexes, and temporary files created within the database,
if no TABLESPACE
clause is given and no other selection is
specified by default_tablespace
or
temp_tablespaces
(as appropriate).
If a database is created without specifying a tablespace for it,
it uses the same tablespace as the template database it is copied from.
Two tablespaces are automatically created when the database cluster
is initialized. The
pg_global
tablespace is used for shared system catalogs. The
pg_default
tablespace is the default tablespace of the
template1
and template0
databases (and, therefore,
will be the default tablespace for other databases as well, unless
overridden by a TABLESPACE
clause in CREATE
DATABASE
).
Once created, a tablespace can be used from any database, provided the requesting user has sufficient privilege. This means that a tablespace cannot be dropped until all objects in all databases using the tablespace have been removed.
To remove an empty tablespace, use the DROP TABLESPACE command.
To determine the set of existing tablespaces, examine the
pg_tablespace
system catalog, for example
SELECT spcname FROM pg_tablespace;
The psql program's \db
meta-command
is also useful for listing the existing tablespaces.
PostgreSQL makes use of symbolic links to simplify the implementation of tablespaces. This means that tablespaces can be used only on systems that support symbolic links.
The directory $PGDATA/pg_tblspc
contains symbolic links that
point to each of the non-built-in tablespaces defined in the cluster.
Although not recommended, it is possible to adjust the tablespace
layout by hand by redefining these links. Under no circumstances perform
this operation while the server is running. Note that in PostgreSQL 9.1
and earlier you will also need to update the pg_tablespace
catalog with the new locations. (If you do not, pg_dump
will
continue to output the old tablespace locations.)
This chapter describes the available localization features from the point of view of the administrator. PostgreSQL supports two localization facilities:
Using the locale features of the operating system to provide locale-specific collation order, number formatting, translated messages, and other aspects. This is covered in Section 24.1 and Section 24.2.
Providing a number of different character sets to support storing text in all kinds of languages, and providing character set translation between client and server. This is covered in Section 24.3.
Locale support refers to an application respecting cultural preferences regarding alphabets, sorting, number formatting, etc. PostgreSQL uses the standard ISO C and POSIX locale facilities provided by the server operating system. For additional information refer to the documentation of your system.
Locale support is automatically initialized when a database
cluster is created using initdb
.
initdb
will initialize the database cluster
with the locale setting of its execution environment by default,
so if your system is already set to use the locale that you want
in your database cluster then there is nothing else you need to
do. If you want to use a different locale (or you are not sure
which locale your system is set to), you can instruct
initdb
exactly which locale to use by
specifying the --locale
option. For example:
initdb --locale=sv_SE
This example for Unix systems sets the locale to Swedish
(sv
) as spoken
in Sweden (SE
). Other possibilities might include
en_US
(U.S. English) and fr_CA
(French
Canadian). If more than one character set can be used for a
locale then the specifications can take the form
language_territory.codeset
. For example,
fr_BE.UTF-8
represents the French language (fr) as
spoken in Belgium (BE), with a UTF-8 character set
encoding.
What locales are available on your
system under what names depends on what was provided by the operating
system vendor and what was installed. On most Unix systems, the command
locale -a
will provide a list of available locales.
Windows uses more verbose locale names, such as German_Germany
or Swedish_Sweden.1252
, but the principles are the same.
Occasionally it is useful to mix rules from several locales, e.g., use English collation rules but Spanish messages. To support that, a set of locale subcategories exist that control only certain aspects of the localization rules:
LC_COLLATE | String sort order |
LC_CTYPE | Character classification (What is a letter? Its upper-case equivalent?) |
LC_MESSAGES | Language of messages |
LC_MONETARY | Formatting of currency amounts |
LC_NUMERIC | Formatting of numbers |
LC_TIME | Formatting of dates and times |
The category names translate into names of
initdb
options to override the locale choice
for a specific category. For instance, to set the locale to
French Canadian, but use U.S. rules for formatting currency, use
initdb --locale=fr_CA --lc-monetary=en_US
.
If you want the system to behave as if it had no locale support,
use the special locale name C
, or equivalently
POSIX
.
Some locale categories must have their values
fixed when the database is created. You can use different settings
for different databases, but once a database is created, you cannot
change them for that database anymore. LC_COLLATE
and LC_CTYPE
are these categories. They affect
the sort order of indexes, so they must be kept fixed, or indexes on
text columns would become corrupt.
(But you can alleviate this restriction using collations, as discussed
in Section 24.2.)
The default values for these
categories are determined when initdb
is run, and
those values are used when new databases are created, unless
specified otherwise in the CREATE DATABASE
command.
The other locale categories can be changed whenever desired
by setting the server configuration parameters
that have the same name as the locale categories (see Section 20.11.2 for details). The values
that are chosen by initdb
are actually only written
into the configuration file postgresql.conf
to
serve as defaults when the server is started. If you remove these
assignments from postgresql.conf
then the
server will inherit the settings from its execution environment.
Note that the locale behavior of the server is determined by the environment variables seen by the server, not by the environment of any client. Therefore, be careful to configure the correct locale settings before starting the server. A consequence of this is that if client and server are set up in different locales, messages might appear in different languages depending on where they originated.
When we speak of inheriting the locale from the execution
environment, this means the following on most operating systems:
For a given locale category, say the collation, the following
environment variables are consulted in this order until one is
found to be set: LC_ALL
, LC_COLLATE
(or the variable corresponding to the respective category),
LANG
. If none of these environment variables are
set then the locale defaults to C
.
Some message localization libraries also look at the environment
variable LANGUAGE
which overrides all other locale
settings for the purpose of setting the language of messages. If
in doubt, please refer to the documentation of your operating
system, in particular the documentation about
gettext.
To enable messages to be translated to the user's preferred language,
NLS must have been selected at build time
(configure --enable-nls
). All other locale support is
built in automatically.
The locale settings influence the following SQL features:
Sort order in queries using ORDER BY
or the standard
comparison operators on textual data
Pattern matching operators (LIKE
, SIMILAR TO
,
and POSIX-style regular expressions); locales affect both case
insensitive matching and the classification of characters by
character-class regular expressions
The ability to use indexes with LIKE
clauses
The drawback of using locales other than C
or
POSIX
in PostgreSQL is its performance
impact. It slows character handling and prevents ordinary indexes
from being used by LIKE
. For this reason use locales
only if you actually need them.
As a workaround to allow PostgreSQL to use indexes
with LIKE
clauses under a non-C locale, several custom
operator classes exist. These allow the creation of an index that
performs a strict character-by-character comparison, ignoring
locale comparison rules. Refer to Section 11.10
for more information. Another approach is to create indexes using
the C
collation, as discussed in
Section 24.2.
If locale support doesn't work according to the explanation above,
check that the locale support in your operating system is
correctly configured. To check what locales are installed on your
system, you can use the command locale -a
if
your operating system provides it.
Check that PostgreSQL is actually using the locale
that you think it is. The LC_COLLATE
and LC_CTYPE
settings are determined when a database is created, and cannot be
changed except by creating a new database. Other locale
settings including LC_MESSAGES
and LC_MONETARY
are initially determined by the environment the server is started
in, but can be changed on-the-fly. You can check the active locale
settings using the SHOW
command.
The directory src/test/locale
in the source
distribution contains a test suite for
PostgreSQL's locale support.
Client applications that handle server-side errors by parsing the text of the error message will obviously have problems when the server's messages are in a different language. Authors of such applications are advised to make use of the error code scheme instead.
Maintaining catalogs of message translations requires the on-going efforts of many volunteers that want to see PostgreSQL speak their preferred language well. If messages in your language are currently not available or not fully translated, your assistance would be appreciated. If you want to help, refer to Chapter 55 or write to the developers' mailing list.
The collation feature allows specifying the sort order and character
classification behavior of data per-column, or even per-operation.
This alleviates the restriction that the
LC_COLLATE
and LC_CTYPE
settings
of a database cannot be changed after its creation.
Conceptually, every expression of a collatable data type has a
collation. (The built-in collatable data types are
text
, varchar
, and char
.
User-defined base types can also be marked collatable, and of course
a domain over a collatable data type is collatable.) If the
expression is a column reference, the collation of the expression is the
defined collation of the column. If the expression is a constant, the
collation is the default collation of the data type of the
constant. The collation of a more complex expression is derived
from the collations of its inputs, as described below.
The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character
classification, it uses the collation of the input expression. This
happens, for example, with ORDER BY
clauses
and function or operator calls such as <
.
The collation to apply for an ORDER BY
clause
is simply the collation of the sort key. The collation to apply for a
function or operator call is derived from the arguments, as described
below. In addition to comparison operators, collations are taken into
account by functions that convert between lower and upper case
letters, such as lower
, upper
, and
initcap
; by pattern matching operators; and by
to_char
and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.
The collation derivation of an expression can be
implicit or explicit. This distinction affects how collations are
combined when multiple different collations appear in an
expression. An explicit collation derivation occurs when a
COLLATE
clause is used; all other collation
derivations are implicit. When multiple collations need to be
combined, for example in a function call, the following rules are
used:
If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.
For example, consider this table definition:
CREATE TABLE test1 ( a text COLLATE "de_DE", b text COLLATE "es_ES", ... );
Then in
SELECT a < 'foo' FROM test1;
the <
comparison is performed according to
de_DE
rules, because the expression combines an
implicitly derived collation with the default collation. But in
SELECT a < ('foo' COLLATE "fr_FR") FROM test1;
the comparison is performed using fr_FR
rules,
because the explicit collation derivation overrides the implicit one.
Furthermore, given
SELECT a < b FROM test1;
the parser cannot determine which collation to apply, since the
a
and b
columns have conflicting
implicit collations. Since the <
operator
does need to know which collation to use, this will result in an
error. The error can be resolved by attaching an explicit collation
specifier to either input expression, thus:
SELECT a < b COLLATE "de_DE" FROM test1;
or equivalently
SELECT a COLLATE "de_DE" < b FROM test1;
On the other hand, the structurally similar case
SELECT a || b FROM test1;
does not result in an error, because the ||
operator
does not care about collations: its result is the same regardless
of the collation.
The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in
SELECT * FROM test1 ORDER BY a || 'foo';
the ordering will be done according to de_DE
rules.
But this query:
SELECT * FROM test1 ORDER BY a || b;
results in an error, because even though the ||
operator
doesn't need to know a collation, the ORDER BY
clause does.
As before, the conflict can be resolved with an explicit collation
specifier:
SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
A collation is an SQL schema object that maps an SQL name to locales
provided by libraries installed in the operating system. A collation
definition has a provider that specifies which
library supplies the locale data. One standard provider name
is libc
, which uses the locales provided by the
operating system C library. These are the locales that most tools
provided by the operating system use. Another provider
is icu
, which uses the external
ICU library. ICU locales can only be
used if support for ICU was configured when PostgreSQL was built.
A collation object provided by libc
maps to a
combination of LC_COLLATE
and LC_CTYPE
settings, as accepted by the setlocale()
system library call. (As
the name would suggest, the main purpose of a collation is to set
LC_COLLATE
, which controls the sort order. But
it is rarely necessary in practice to have an
LC_CTYPE
setting that is different from
LC_COLLATE
, so it is more convenient to collect
these under one concept than to create another infrastructure for
setting LC_CTYPE
per expression.) Also,
a libc
collation
is tied to a character set encoding (see Section 24.3).
The same collation name may exist for different encodings.
A collation object provided by icu
maps to a named
collator provided by the ICU library. ICU does not support
separate “collate” and “ctype” settings, so
they are always the same. Also, ICU collations are independent of the
encoding, so there is always only one ICU collation of a given name in
a database.
On all platforms, the collations named default
,
C
, and POSIX
are available. Additional
collations may be available depending on operating system support.
The default
collation selects the LC_COLLATE
and LC_CTYPE
values specified at database creation time.
The C
and POSIX
collations both specify
“traditional C” behavior, in which only the ASCII letters
“A
” through “Z
”
are treated as letters, and sorting is done strictly by character
code byte values.
Additionally, the SQL standard collation name ucs_basic
is available for encoding UTF8
. It is equivalent
to C
and sorts by Unicode code point.
If the operating system provides support for using multiple locales
within a single program (newlocale
and related functions),
or if support for ICU is configured,
then when a database cluster is initialized, initdb
populates the system catalog pg_collation
with
collations based on all the locales it finds in the operating
system at the time.
To inspect the currently available locales, use the query SELECT
* FROM pg_collation
, or the command \dOS+
in psql.
For example, the operating system might
provide a locale named de_DE.utf8
.
initdb
would then create a collation named
de_DE.utf8
for encoding UTF8
that has both LC_COLLATE
and
LC_CTYPE
set to de_DE.utf8
.
It will also create a collation with the .utf8
tag stripped off the name. So you could also use the collation
under the name de_DE
, which is less cumbersome
to write and makes the name less encoding-dependent. Note that,
nevertheless, the initial set of collation names is
platform-dependent.
The default set of collations provided by libc
map
directly to the locales installed in the operating system, which can be
listed using the command locale -a
. In case
a libc
collation is needed that has different values
for LC_COLLATE
and LC_CTYPE
, or if new
locales are installed in the operating system after the database system
was initialized, then a new collation may be created using
the CREATE COLLATION command.
New operating system locales can also be imported en masse using
the pg_import_system_collations()
function.
Within any particular database, only collations that use that
database's encoding are of interest. Other entries in
pg_collation
are ignored. Thus, a stripped collation
name such as de_DE
can be considered unique
within a given database even though it would not be unique globally.
Use of the stripped collation names is recommended, since it will
make one fewer thing you need to change if you decide to change to
another database encoding. Note however that the default
,
C
, and POSIX
collations can be used regardless of
the database encoding.
PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,
SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;
will draw an error even though the C
and POSIX
collations have identical behaviors. Mixing stripped and non-stripped
collation names is therefore not recommended.
With ICU, it is not sensible to enumerate all possible locale names. ICU
uses a particular naming system for locales, but there are many more ways
to name a locale than there are actually distinct locales.
initdb
uses the ICU APIs to extract a set of distinct
locales to populate the initial set of collations. Collations provided by
ICU are created in the SQL environment with names in BCP 47 language tag
format, with a “private use”
extension -x-icu
appended, to distinguish them from
libc locales.
Here are some example collations that might be created:
de-x-icu
German collation, default variant
de-AT-x-icu
German collation for Austria, default variant
(There are also, say, de-DE-x-icu
or de-CH-x-icu
, but as of this writing, they are
equivalent to de-x-icu
.)
und-x-icu
(for “undefined”)ICU “root” collation. Use this to get a reasonable language-agnostic sort order.
Some (less frequently used) encodings are not supported by ICU. When the
database encoding is one of these, ICU collation entries
in pg_collation
are ignored. Attempting to use one
will draw an error along the lines of “collation "de-x-icu" for
encoding "WIN874" does not exist”.
If the standard and predefined collations are not sufficient, users can create their own collation objects using the SQL command CREATE COLLATION.
The standard and predefined collations are in the
schema pg_catalog
, like all predefined objects.
User-defined collations should be created in user schemas. This also
ensures that they are saved by pg_dump
.
New libc collations can be created like this:
CREATE COLLATION german (provider = libc, locale = 'de_DE');
The exact values that are acceptable for the locale
clause in this command depend on the operating system. On Unix-like
systems, the command locale -a
will show a list.
Since the predefined libc collations already include all collations
defined in the operating system when the database instance is
initialized, it is not often necessary to manually create new ones.
Reasons might be if a different naming system is desired (in which case
see also Section 24.2.2.3.3) or if the operating system has
been upgraded to provide new locale definitions (in which case see
also pg_import_system_collations()
).
ICU allows collations to be customized beyond the basic language+country
set that is preloaded by initdb
. Users are encouraged
to define their own collation objects that make use of these facilities to
suit the sorting behavior to their requirements.
See https://unicode-org.github.io/icu/userguide/locale/
and https://unicode-org.github.io/icu/userguide/collation/api.html for
information on ICU locale naming. The set of acceptable names and
attributes depends on the particular ICU version.
Here are some examples:
CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');
CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');
German collation with phone book collation type
The first example selects the ICU locale using a “language tag” per BCP 47. The second example uses the traditional ICU-specific locale syntax. The first style is preferred going forward, but it is not supported by older ICU versions.
Note that you can name the collation objects in the SQL environment anything you want. In this example, we follow the naming style that the predefined collations use, which in turn also follow BCP 47, but that is not required for user-defined collations.
CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');
CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');
Root collation with Emoji collation type, per Unicode Technical Standard #51
Observe how in the traditional ICU locale naming system, the root locale is selected by an empty string.
CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');
CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');
Sort Greek letters before Latin ones. (The default is Latin before Greek.)
CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');
Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)
CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');
CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');
Combines both of the above options.
CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');
CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');
Numeric ordering, sorts sequences of digits by their numeric value,
for example: A-21
< A-123
(also known as natural sort).
See Unicode
Technical Standard #35
and BCP 47 for
details. The list of possible collation types (co
subtag) can be found in
the CLDR
repository.
Note that while this system allows creating collations that “ignore
case” or “ignore accents” or similar (using the
ks
key), in order for such collations to act in a
truly case- or accent-insensitive manner, they also need to be declared as not
deterministic in CREATE COLLATION
;
see Section 24.2.2.4.
Otherwise, any strings that compare equal according to the collation but
are not byte-wise equal will be sorted according to their byte values.
By design, ICU will accept almost any string as a locale name and match it to the closest locale it can provide, using the fallback procedure described in its documentation. Thus, there will be no direct feedback if a collation specification is composed using features that the given ICU installation does not actually support. It is therefore recommended to create application-level test cases to check that the collation definitions satisfy one's requirements.
The command CREATE COLLATION can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications, create compatibility names, or use an ICU-provided collation under a more readable name. For example:
CREATE COLLATION german FROM "de_DE"; CREATE COLLATION french FROM "fr-x-icu";
A collation is either deterministic or nondeterministic. A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence. Nondeterministic comparison may determine strings to be equal even if they consist of different bytes. Typical situations include case-insensitive comparison, accent-insensitive comparison, as well as comparison of strings in different Unicode normal forms. It is up to the collation provider to actually implement such insensitive comparisons; the deterministic flag only determines whether ties are to be broken using bytewise comparison. See also Unicode Technical Standard 10 for more information on the terminology.
To create a nondeterministic collation, specify the property
deterministic = false
to CREATE
COLLATION
, for example:
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
This example would use the standard Unicode collation in a nondeterministic way. In particular, this would allow strings in different normal forms to be compared correctly. More interesting examples make use of the ICU customization facilities explained above. For example:
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
All standard and predefined collations are deterministic, all user-defined collations are deterministic by default. While nondeterministic collations give a more “correct” behavior, especially when considering the full power of Unicode and its many special cases, they also have some drawbacks. Foremost, their use leads to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted.
To deal with text in different Unicode normalization forms, it is also
an option to use the functions/expressions
normalize
and is normalized
to
preprocess or check the strings, instead of using nondeterministic
collations. There are different trade-offs for each approach.
The character set support in PostgreSQL
allows you to store text in a variety of character sets (also called
encodings), including
single-byte character sets such as the ISO 8859 series and
multiple-byte character sets such as EUC (Extended Unix
Code), UTF-8, and Mule internal code. All supported character sets
can be used transparently by clients, but a few are not supported
for use within the server (that is, as a server-side encoding).
The default character set is selected while
initializing your PostgreSQL database
cluster using initdb
. It can be overridden when you
create a database, so you can have multiple
databases each with a different character set.
An important restriction, however, is that each database's character set
must be compatible with the database's LC_CTYPE
(character
classification) and LC_COLLATE
(string sort order) locale
settings. For C
or
POSIX
locale, any character set is allowed, but for other
libc-provided locales there is only one character set that will work
correctly.
(On Windows, however, UTF-8 encoding can be used with any locale.)
If you have ICU support configured, ICU-provided locales can be used
with most but not all server-side encodings.
Table 24.1 shows the character sets available for use in PostgreSQL.
Table 24.1. PostgreSQL Character Sets
Name | Description | Language | Server? | ICU? | Bytes/Char | Aliases |
---|---|---|---|---|---|---|
BIG5 | Big Five | Traditional Chinese | No | No | 1–2 | WIN950 , Windows950 |
EUC_CN | Extended UNIX Code-CN | Simplified Chinese | Yes | Yes | 1–3 | |
EUC_JP | Extended UNIX Code-JP | Japanese | Yes | Yes | 1–3 | |
EUC_JIS_2004 | Extended UNIX Code-JP, JIS X 0213 | Japanese | Yes | No | 1–3 | |
EUC_KR | Extended UNIX Code-KR | Korean | Yes | Yes | 1–3 | |
EUC_TW | Extended UNIX Code-TW | Traditional Chinese, Taiwanese | Yes | Yes | 1–3 | |
GB18030 | National Standard | Chinese | No | No | 1–4 | |
GBK | Extended National Standard | Simplified Chinese | No | No | 1–2 | WIN936 , Windows936 |
ISO_8859_5 | ISO 8859-5, ECMA 113 | Latin/Cyrillic | Yes | Yes | 1 | |
ISO_8859_6 | ISO 8859-6, ECMA 114 | Latin/Arabic | Yes | Yes | 1 | |
ISO_8859_7 | ISO 8859-7, ECMA 118 | Latin/Greek | Yes | Yes | 1 | |
ISO_8859_8 | ISO 8859-8, ECMA 121 | Latin/Hebrew | Yes | Yes | 1 | |
JOHAB | JOHAB | Korean (Hangul) | No | No | 1–3 | |
KOI8R | KOI8-R | Cyrillic (Russian) | Yes | Yes | 1 | KOI8 |
KOI8U | KOI8-U | Cyrillic (Ukrainian) | Yes | Yes | 1 | |
LATIN1 | ISO 8859-1, ECMA 94 | Western European | Yes | Yes | 1 | ISO88591 |
LATIN2 | ISO 8859-2, ECMA 94 | Central European | Yes | Yes | 1 | ISO88592 |
LATIN3 | ISO 8859-3, ECMA 94 | South European | Yes | Yes | 1 | ISO88593 |
LATIN4 | ISO 8859-4, ECMA 94 | North European | Yes | Yes | 1 | ISO88594 |
LATIN5 | ISO 8859-9, ECMA 128 | Turkish | Yes | Yes | 1 | ISO88599 |
LATIN6 | ISO 8859-10, ECMA 144 | Nordic | Yes | Yes | 1 | ISO885910 |
LATIN7 | ISO 8859-13 | Baltic | Yes | Yes | 1 | ISO885913 |
LATIN8 | ISO 8859-14 | Celtic | Yes | Yes | 1 | ISO885914 |
LATIN9 | ISO 8859-15 | LATIN1 with Euro and accents | Yes | Yes | 1 | ISO885915 |
LATIN10 | ISO 8859-16, ASRO SR 14111 | Romanian | Yes | No | 1 | ISO885916 |
MULE_INTERNAL | Mule internal code | Multilingual Emacs | Yes | No | 1–4 | |
SJIS | Shift JIS | Japanese | No | No | 1–2 | Mskanji , ShiftJIS , WIN932 , Windows932 |
SHIFT_JIS_2004 | Shift JIS, JIS X 0213 | Japanese | No | No | 1–2 | |
SQL_ASCII | unspecified (see text) | any | Yes | No | 1 | |
UHC | Unified Hangul Code | Korean | No | No | 1–2 | WIN949 , Windows949 |
UTF8 | Unicode, 8-bit | all | Yes | Yes | 1–4 | Unicode |
WIN866 | Windows CP866 | Cyrillic | Yes | Yes | 1 | ALT |
WIN874 | Windows CP874 | Thai | Yes | No | 1 | |
WIN1250 | Windows CP1250 | Central European | Yes | Yes | 1 | |
WIN1251 | Windows CP1251 | Cyrillic | Yes | Yes | 1 | WIN |
WIN1252 | Windows CP1252 | Western European | Yes | Yes | 1 | |
WIN1253 | Windows CP1253 | Greek | Yes | Yes | 1 | |
WIN1254 | Windows CP1254 | Turkish | Yes | Yes | 1 | |
WIN1255 | Windows CP1255 | Hebrew | Yes | Yes | 1 | |
WIN1256 | Windows CP1256 | Arabic | Yes | Yes | 1 | |
WIN1257 | Windows CP1257 | Baltic | Yes | Yes | 1 | |
WIN1258 | Windows CP1258 | Vietnamese | Yes | Yes | 1 | ABC , TCVN , TCVN5712 , VSCII |
Not all client APIs support all the listed character sets. For example, the
PostgreSQL
JDBC driver does not support MULE_INTERNAL
, LATIN6
,
LATIN8
, and LATIN10
.
The SQL_ASCII
setting behaves considerably differently
from the other settings. When the server character set is
SQL_ASCII
, the server interprets byte values 0–127
according to the ASCII standard, while byte values 128–255 are taken
as uninterpreted characters. No encoding conversion will be done when
the setting is SQL_ASCII
. Thus, this setting is not so
much a declaration that a specific encoding is in use, as a declaration
of ignorance about the encoding. In most cases, if you are
working with any non-ASCII data, it is unwise to use the
SQL_ASCII
setting because
PostgreSQL will be unable to help you by
converting or validating non-ASCII characters.
initdb
defines the default character set (encoding)
for a PostgreSQL cluster. For example,
initdb -E EUC_JP
sets the default character set to
EUC_JP
(Extended Unix Code for Japanese). You
can use --encoding
instead of
-E
if you prefer longer option strings.
If no -E
or --encoding
option is
given, initdb
attempts to determine the appropriate
encoding to use based on the specified or default locale.
You can specify a non-default encoding at database creation time, provided that the encoding is compatible with the selected locale:
createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
This will create a database named korean
that
uses the character set EUC_KR
, and locale ko_KR
.
Another way to accomplish this is to use this SQL command:
CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
Notice that the above commands specify copying the template0
database. When copying any other database, the encoding and locale
settings cannot be changed from those of the source database, because
that might result in corrupt data. For more information see
Section 23.3.
The encoding for a database is stored in the system catalog
pg_database
. You can see it by using the
psql
-l
option or the
\l
command.
$ psql -l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access Privileges
-----------+----------+-----------+-------------+-------------+-------------------------------------
clocaledb | hlinnaka | SQL_ASCII | C | C |
englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 |
japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 |
korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr |
postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 |
template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
(7 rows)
On most modern operating systems, PostgreSQL
can determine which character set is implied by the LC_CTYPE
setting, and it will enforce that only the matching database encoding is
used. On older systems it is your responsibility to ensure that you use
the encoding expected by the locale you have selected. A mistake in
this area is likely to lead to strange behavior of locale-dependent
operations such as sorting.
PostgreSQL will allow superusers to create
databases with SQL_ASCII
encoding even when
LC_CTYPE
is not C
or POSIX
. As noted
above, SQL_ASCII
does not enforce that the data stored in
the database has any particular encoding, and so this choice poses risks
of locale-dependent misbehavior. Using this combination of settings is
deprecated and may someday be forbidden altogether.
PostgreSQL supports automatic character set conversion between server and client for many combinations of character sets (Section 24.3.4 shows which ones).
To enable automatic character set conversion, you have to tell PostgreSQL the character set (encoding) you would like to use in the client. There are several ways to accomplish this:
Using the \encoding
command in
psql.
\encoding
allows you to change client
encoding on the fly. For
example, to change the encoding to SJIS
, type:
\encoding SJIS
libpq (Section 34.11) has functions to control the client encoding.
Using SET client_encoding TO
.
Setting the client encoding can be done with this SQL command:
SET CLIENT_ENCODING TO 'value
';
Also you can use the standard SQL syntax SET NAMES
for this purpose:
SET NAMES 'value
';
To query the current client encoding:
SHOW client_encoding;
To return to the default encoding:
RESET client_encoding;
Using PGCLIENTENCODING
. If the environment variable
PGCLIENTENCODING
is defined in the client's
environment, that client encoding is automatically selected
when a connection to the server is made. (This can
subsequently be overridden using any of the other methods
mentioned above.)
Using the configuration variable client_encoding. If the
client_encoding
variable is set, that client
encoding is automatically selected when a connection to the
server is made. (This can subsequently be overridden using any
of the other methods mentioned above.)
If the conversion of a particular character is not possible
— suppose you chose EUC_JP
for the
server and LATIN1
for the client, and some
Japanese characters are returned that do not have a representation in
LATIN1
— an error is reported.
If the client character set is defined as SQL_ASCII
,
encoding conversion is disabled, regardless of the server's character
set. (However, if the server's character set is
not SQL_ASCII
, the server will still check that
incoming data is valid for that encoding; so the net effect is as
though the client character set were the same as the server's.)
Just as for the server, use of SQL_ASCII
is unwise
unless you are working with all-ASCII data.
PostgreSQL allows conversion between any
two character sets for which a conversion function is listed in the
pg_conversion
system catalog. PostgreSQL comes with
some predefined conversions, as summarized in
Table 24.2 and shown in more
detail in Table 24.3. You can
create a new conversion using the SQL command
CREATE CONVERSION. (To be used for automatic
client/server conversions, a conversion must be marked
as “default” for its character set pair.)
Table 24.2. Built-in Client/Server Character Set Conversions
Server Character Set | Available Client Character Sets |
---|---|
BIG5 | not supported as a server encoding |
EUC_CN | EUC_CN,
MULE_INTERNAL ,
UTF8
|
EUC_JP | EUC_JP,
MULE_INTERNAL ,
SJIS ,
UTF8
|
EUC_JIS_2004 | EUC_JIS_2004,
SHIFT_JIS_2004 ,
UTF8
|
EUC_KR | EUC_KR,
MULE_INTERNAL ,
UTF8
|
EUC_TW | EUC_TW,
BIG5 ,
MULE_INTERNAL ,
UTF8
|
GB18030 | not supported as a server encoding |
GBK | not supported as a server encoding |
ISO_8859_5 | ISO_8859_5,
KOI8R ,
MULE_INTERNAL ,
UTF8 ,
WIN866 ,
WIN1251
|
ISO_8859_6 | ISO_8859_6,
UTF8
|
ISO_8859_7 | ISO_8859_7,
UTF8
|
ISO_8859_8 | ISO_8859_8,
UTF8
|
JOHAB | not supported as a server encoding |
KOI8R | KOI8R,
ISO_8859_5 ,
MULE_INTERNAL ,
UTF8 ,
WIN866 ,
WIN1251
|
KOI8U | KOI8U,
UTF8
|
LATIN1 | LATIN1,
MULE_INTERNAL ,
UTF8
|
LATIN2 | LATIN2,
MULE_INTERNAL ,
UTF8 ,
WIN1250
|
LATIN3 | LATIN3,
MULE_INTERNAL ,
UTF8
|
LATIN4 | LATIN4,
MULE_INTERNAL ,
UTF8
|
LATIN5 | LATIN5,
UTF8
|
LATIN6 | LATIN6,
UTF8
|
LATIN7 | LATIN7,
UTF8
|
LATIN8 | LATIN8,
UTF8
|
LATIN9 | LATIN9,
UTF8
|
LATIN10 | LATIN10,
UTF8
|
MULE_INTERNAL | MULE_INTERNAL,
BIG5 ,
EUC_CN ,
EUC_JP ,
EUC_KR ,
EUC_TW ,
ISO_8859_5 ,
KOI8R ,
LATIN1 to LATIN4 ,
SJIS ,
WIN866 ,
WIN1250 ,
WIN1251
|
SJIS | not supported as a server encoding |
SHIFT_JIS_2004 | not supported as a server encoding |
SQL_ASCII | any (no conversion will be performed) |
UHC | not supported as a server encoding |
UTF8 | all supported encodings |
WIN866 | WIN866,
ISO_8859_5 ,
KOI8R ,
MULE_INTERNAL ,
UTF8 ,
WIN1251
|
WIN874 | WIN874,
UTF8
|
WIN1250 | WIN1250,
LATIN2 ,
MULE_INTERNAL ,
UTF8
|
WIN1251 | WIN1251,
ISO_8859_5 ,
KOI8R ,
MULE_INTERNAL ,
UTF8 ,
WIN866
|
WIN1252 | WIN1252,
UTF8
|
WIN1253 | WIN1253,
UTF8
|
WIN1254 | WIN1254,
UTF8
|
WIN1255 | WIN1255,
UTF8
|
WIN1256 | WIN1256,
UTF8
|
WIN1257 | WIN1257,
UTF8
|
WIN1258 | WIN1258,
UTF8
|
Table 24.3. All Built-in Character Set Conversions
Conversion Name [a] | Source Encoding | Destination Encoding |
---|---|---|
big5_to_euc_tw | BIG5 | EUC_TW |
big5_to_mic | BIG5 | MULE_INTERNAL |
big5_to_utf8 | BIG5 | UTF8 |
euc_cn_to_mic | EUC_CN | MULE_INTERNAL |
euc_cn_to_utf8 | EUC_CN | UTF8 |
euc_jp_to_mic | EUC_JP | MULE_INTERNAL |
euc_jp_to_sjis | EUC_JP | SJIS |
euc_jp_to_utf8 | EUC_JP | UTF8 |
euc_kr_to_mic | EUC_KR | MULE_INTERNAL |
euc_kr_to_utf8 | EUC_KR | UTF8 |
euc_tw_to_big5 | EUC_TW | BIG5 |
euc_tw_to_mic | EUC_TW | MULE_INTERNAL |
euc_tw_to_utf8 | EUC_TW | UTF8 |
gb18030_to_utf8 | GB18030 | UTF8 |
gbk_to_utf8 | GBK | UTF8 |
iso_8859_10_to_utf8 | LATIN6 | UTF8 |
iso_8859_13_to_utf8 | LATIN7 | UTF8 |
iso_8859_14_to_utf8 | LATIN8 | UTF8 |
iso_8859_15_to_utf8 | LATIN9 | UTF8 |
iso_8859_16_to_utf8 | LATIN10 | UTF8 |
iso_8859_1_to_mic | LATIN1 | MULE_INTERNAL |
iso_8859_1_to_utf8 | LATIN1 | UTF8 |
iso_8859_2_to_mic | LATIN2 | MULE_INTERNAL |
iso_8859_2_to_utf8 | LATIN2 | UTF8 |
iso_8859_2_to_windows_1250 | LATIN2 | WIN1250 |
iso_8859_3_to_mic | LATIN3 | MULE_INTERNAL |
iso_8859_3_to_utf8 | LATIN3 | UTF8 |
iso_8859_4_to_mic | LATIN4 | MULE_INTERNAL |
iso_8859_4_to_utf8 | LATIN4 | UTF8 |
iso_8859_5_to_koi8_r | ISO_8859_5 | KOI8R |
iso_8859_5_to_mic | ISO_8859_5 | MULE_INTERNAL |
iso_8859_5_to_utf8 | ISO_8859_5 | UTF8 |
iso_8859_5_to_windows_1251 | ISO_8859_5 | WIN1251 |
iso_8859_5_to_windows_866 | ISO_8859_5 | WIN866 |
iso_8859_6_to_utf8 | ISO_8859_6 | UTF8 |
iso_8859_7_to_utf8 | ISO_8859_7 | UTF8 |
iso_8859_8_to_utf8 | ISO_8859_8 | UTF8 |
iso_8859_9_to_utf8 | LATIN5 | UTF8 |
johab_to_utf8 | JOHAB | UTF8 |
koi8_r_to_iso_8859_5 | KOI8R | ISO_8859_5 |
koi8_r_to_mic | KOI8R | MULE_INTERNAL |
koi8_r_to_utf8 | KOI8R | UTF8 |
koi8_r_to_windows_1251 | KOI8R | WIN1251 |
koi8_r_to_windows_866 | KOI8R | WIN866 |
koi8_u_to_utf8 | KOI8U | UTF8 |
mic_to_big5 | MULE_INTERNAL | BIG5 |
mic_to_euc_cn | MULE_INTERNAL | EUC_CN |
mic_to_euc_jp | MULE_INTERNAL | EUC_JP |
mic_to_euc_kr | MULE_INTERNAL | EUC_KR |
mic_to_euc_tw | MULE_INTERNAL | EUC_TW |
mic_to_iso_8859_1 | MULE_INTERNAL | LATIN1 |
mic_to_iso_8859_2 | MULE_INTERNAL | LATIN2 |
mic_to_iso_8859_3 | MULE_INTERNAL | LATIN3 |
mic_to_iso_8859_4 | MULE_INTERNAL | LATIN4 |
mic_to_iso_8859_5 | MULE_INTERNAL | ISO_8859_5 |
mic_to_koi8_r | MULE_INTERNAL | KOI8R |
mic_to_sjis | MULE_INTERNAL | SJIS |
mic_to_windows_1250 | MULE_INTERNAL | WIN1250 |
mic_to_windows_1251 | MULE_INTERNAL | WIN1251 |
mic_to_windows_866 | MULE_INTERNAL | WIN866 |
sjis_to_euc_jp | SJIS | EUC_JP |
sjis_to_mic | SJIS | MULE_INTERNAL |
sjis_to_utf8 | SJIS | UTF8 |
windows_1258_to_utf8 | WIN1258 | UTF8 |
uhc_to_utf8 | UHC | UTF8 |
utf8_to_big5 | UTF8 | BIG5 |
utf8_to_euc_cn | UTF8 | EUC_CN |
utf8_to_euc_jp | UTF8 | EUC_JP |
utf8_to_euc_kr | UTF8 | EUC_KR |
utf8_to_euc_tw | UTF8 | EUC_TW |
utf8_to_gb18030 | UTF8 | GB18030 |
utf8_to_gbk | UTF8 | GBK |
utf8_to_iso_8859_1 | UTF8 | LATIN1 |
utf8_to_iso_8859_10 | UTF8 | LATIN6 |
utf8_to_iso_8859_13 | UTF8 | LATIN7 |
utf8_to_iso_8859_14 | UTF8 | LATIN8 |
utf8_to_iso_8859_15 | UTF8 | LATIN9 |
utf8_to_iso_8859_16 | UTF8 | LATIN10 |
utf8_to_iso_8859_2 | UTF8 | LATIN2 |
utf8_to_iso_8859_3 | UTF8 | LATIN3 |
utf8_to_iso_8859_4 | UTF8 | LATIN4 |
utf8_to_iso_8859_5 | UTF8 | ISO_8859_5 |
utf8_to_iso_8859_6 | UTF8 | ISO_8859_6 |
utf8_to_iso_8859_7 | UTF8 | ISO_8859_7 |
utf8_to_iso_8859_8 | UTF8 | ISO_8859_8 |
utf8_to_iso_8859_9 | UTF8 | LATIN5 |
utf8_to_johab | UTF8 | JOHAB |
utf8_to_koi8_r | UTF8 | KOI8R |
utf8_to_koi8_u | UTF8 | KOI8U |
utf8_to_sjis | UTF8 | SJIS |
utf8_to_windows_1258 | UTF8 | WIN1258 |
utf8_to_uhc | UTF8 | UHC |
utf8_to_windows_1250 | UTF8 | WIN1250 |
utf8_to_windows_1251 | UTF8 | WIN1251 |
utf8_to_windows_1252 | UTF8 | WIN1252 |
utf8_to_windows_1253 | UTF8 | WIN1253 |
utf8_to_windows_1254 | UTF8 | WIN1254 |
utf8_to_windows_1255 | UTF8 | WIN1255 |
utf8_to_windows_1256 | UTF8 | WIN1256 |
utf8_to_windows_1257 | UTF8 | WIN1257 |
utf8_to_windows_866 | UTF8 | WIN866 |
utf8_to_windows_874 | UTF8 | WIN874 |
windows_1250_to_iso_8859_2 | WIN1250 | LATIN2 |
windows_1250_to_mic | WIN1250 | MULE_INTERNAL |
windows_1250_to_utf8 | WIN1250 | UTF8 |
windows_1251_to_iso_8859_5 | WIN1251 | ISO_8859_5 |
windows_1251_to_koi8_r | WIN1251 | KOI8R |
windows_1251_to_mic | WIN1251 | MULE_INTERNAL |
windows_1251_to_utf8 | WIN1251 | UTF8 |
windows_1251_to_windows_866 | WIN1251 | WIN866 |
windows_1252_to_utf8 | WIN1252 | UTF8 |
windows_1256_to_utf8 | WIN1256 | UTF8 |
windows_866_to_iso_8859_5 | WIN866 | ISO_8859_5 |
windows_866_to_koi8_r | WIN866 | KOI8R |
windows_866_to_mic | WIN866 | MULE_INTERNAL |
windows_866_to_utf8 | WIN866 | UTF8 |
windows_866_to_windows_1251 | WIN866 | WIN |
windows_874_to_utf8 | WIN874 | UTF8 |
euc_jis_2004_to_utf8 | EUC_JIS_2004 | UTF8 |
utf8_to_euc_jis_2004 | UTF8 | EUC_JIS_2004 |
shift_jis_2004_to_utf8 | SHIFT_JIS_2004 | UTF8 |
utf8_to_shift_jis_2004 | UTF8 | SHIFT_JIS_2004 |
euc_jis_2004_to_shift_jis_2004 | EUC_JIS_2004 | SHIFT_JIS_2004 |
shift_jis_2004_to_euc_jis_2004 | SHIFT_JIS_2004 | EUC_JIS_2004 |
[a]
The conversion names follow a standard naming scheme: The
official name of the source encoding with all
non-alphanumeric characters replaced by underscores, followed
by |
These are good sources to start learning about various kinds of encoding systems.
Contains detailed explanations of EUC_JP
,
EUC_CN
, EUC_KR
,
EUC_TW
.
The web site of the Unicode Consortium.
UTF-8 (8-bit UCS/Unicode Transformation Format) is defined here.
Table of Contents
PostgreSQL, like any database software, requires that certain tasks be performed regularly to achieve optimum performance. The tasks discussed here are required, but they are repetitive in nature and can easily be automated using standard tools such as cron scripts or Windows' Task Scheduler. It is the database administrator's responsibility to set up appropriate scripts, and to check that they execute successfully.
One obvious maintenance task is the creation of backup copies of the data on a regular schedule. Without a recent backup, you have no chance of recovery after a catastrophe (disk failure, fire, mistakenly dropping a critical table, etc.). The backup and recovery mechanisms available in PostgreSQL are discussed at length in Chapter 26.
The other main category of maintenance task is periodic “vacuuming” of the database. This activity is discussed in Section 25.1. Closely related to this is updating the statistics that will be used by the query planner, as discussed in Section 25.1.3.
Another task that might need periodic attention is log file management. This is discussed in Section 25.3.
check_postgres is available for monitoring database health and reporting unusual conditions. check_postgres integrates with Nagios and MRTG, but can be run standalone too.
PostgreSQL is low-maintenance compared to some other database management systems. Nonetheless, appropriate attention to these tasks will go far towards ensuring a pleasant and productive experience with the system.
PostgreSQL databases require periodic
maintenance known as vacuuming. For many installations, it
is sufficient to let vacuuming be performed by the autovacuum
daemon, which is described in Section 25.1.6. You might
need to adjust the autovacuuming parameters described there to obtain best
results for your situation. Some database administrators will want to
supplement or replace the daemon's activities with manually-managed
VACUUM
commands, which typically are executed according to a
schedule by cron or Task
Scheduler scripts. To set up manually-managed vacuuming properly,
it is essential to understand the issues discussed in the next few
subsections. Administrators who rely on autovacuuming may still wish
to skim this material to help them understand and adjust autovacuuming.
PostgreSQL's
VACUUM
command has to
process each table on a regular basis for several reasons:
Each of these reasons dictates performing VACUUM
operations
of varying frequency and scope, as explained in the following subsections.
There are two variants of VACUUM
: standard VACUUM
and VACUUM FULL
. VACUUM FULL
can reclaim more
disk space but runs much more slowly. Also,
the standard form of VACUUM
can run in parallel with production
database operations. (Commands such as SELECT
,
INSERT
, UPDATE
, and
DELETE
will continue to function normally, though you
will not be able to modify the definition of a table with commands such as
ALTER TABLE
while it is being vacuumed.)
VACUUM FULL
requires an
ACCESS EXCLUSIVE
lock on the table it is
working on, and therefore cannot be done in parallel with other use
of the table. Generally, therefore,
administrators should strive to use standard VACUUM
and
avoid VACUUM FULL
.
VACUUM
creates a substantial amount of I/O
traffic, which can cause poor performance for other active sessions.
There are configuration parameters that can be adjusted to reduce the
performance impact of background vacuuming — see
Section 20.4.4.
In PostgreSQL, an
UPDATE
or DELETE
of a row does not
immediately remove the old version of the row.
This approach is necessary to gain the benefits of multiversion
concurrency control (MVCC, see Chapter 13): the row version
must not be deleted while it is still potentially visible to other
transactions. But eventually, an outdated or deleted row version is no
longer of interest to any transaction. The space it occupies must then be
reclaimed for reuse by new rows, to avoid unbounded growth of disk
space requirements. This is done by running VACUUM
.
The standard form of VACUUM
removes dead row
versions in tables and indexes and marks the space available for
future reuse. However, it will not return the space to the operating
system, except in the special case where one or more pages at the
end of a table become entirely free and an exclusive table lock can be
easily obtained. In contrast, VACUUM FULL
actively compacts
tables by writing a complete new version of the table file with no dead
space. This minimizes the size of the table, but can take a long time.
It also requires extra disk space for the new copy of the table, until
the operation completes.
The usual goal of routine vacuuming is to do standard VACUUM
s
often enough to avoid needing VACUUM FULL
. The
autovacuum daemon attempts to work this way, and in fact will
never issue VACUUM FULL
. In this approach, the idea
is not to keep tables at their minimum size, but to maintain steady-state
usage of disk space: each table occupies space equivalent to its
minimum size plus however much space gets used up between vacuum runs.
Although VACUUM FULL
can be used to shrink a table back
to its minimum size and return the disk space to the operating system,
there is not much point in this if the table will just grow again in the
future. Thus, moderately-frequent standard VACUUM
runs are a
better approach than infrequent VACUUM FULL
runs for
maintaining heavily-updated tables.
Some administrators prefer to schedule vacuuming themselves, for example
doing all the work at night when load is low.
The difficulty with doing vacuuming according to a fixed schedule
is that if a table has an unexpected spike in update activity, it may
get bloated to the point that VACUUM FULL
is really necessary
to reclaim space. Using the autovacuum daemon alleviates this problem,
since the daemon schedules vacuuming dynamically in response to update
activity. It is unwise to disable the daemon completely unless you
have an extremely predictable workload. One possible compromise is
to set the daemon's parameters so that it will only react to unusually
heavy update activity, thus keeping things from getting out of hand,
while scheduled VACUUM
s are expected to do the bulk of the
work when the load is typical.
For those not using autovacuum, a typical approach is to schedule a
database-wide VACUUM
once a day during a low-usage period,
supplemented by more frequent vacuuming of heavily-updated tables as
necessary. (Some installations with extremely high update rates vacuum
their busiest tables as often as once every few minutes.) If you have
multiple databases in a cluster, don't forget to
VACUUM
each one; the program vacuumdb might be helpful.
Plain VACUUM
may not be satisfactory when
a table contains large numbers of dead row versions as a result of
massive update or delete activity. If you have such a table and
you need to reclaim the excess disk space it occupies, you will need
to use VACUUM FULL
, or alternatively
CLUSTER
or one of the table-rewriting variants of
ALTER TABLE
.
These commands rewrite an entire new copy of the table and build
new indexes for it. All these options require an
ACCESS EXCLUSIVE
lock. Note that
they also temporarily use extra disk space approximately equal to the size
of the table, since the old copies of the table and indexes can't be
released until the new ones are complete.
If you have a table whose entire contents are deleted on a periodic
basis, consider doing it with
TRUNCATE
rather
than using DELETE
followed by
VACUUM
. TRUNCATE
removes the
entire content of the table immediately, without requiring a
subsequent VACUUM
or VACUUM
FULL
to reclaim the now-unused disk space.
The disadvantage is that strict MVCC semantics are violated.
The PostgreSQL query planner relies on
statistical information about the contents of tables in order to
generate good plans for queries. These statistics are gathered by
the ANALYZE
command,
which can be invoked by itself or
as an optional step in VACUUM
. It is important to have
reasonably accurate statistics, otherwise poor choices of plans might
degrade database performance.
The autovacuum daemon, if enabled, will automatically issue
ANALYZE
commands whenever the content of a table has
changed sufficiently. However, administrators might prefer to rely
on manually-scheduled ANALYZE
operations, particularly
if it is known that update activity on a table will not affect the
statistics of “interesting” columns. The daemon schedules
ANALYZE
strictly as a function of the number of rows
inserted or updated; it has no knowledge of whether that will lead
to meaningful statistical changes.
Tuples changed in partitions and inheritance children do not trigger
analyze on the parent table. If the parent table is empty or rarely
changed, it may never be processed by autovacuum, and the statistics for
the inheritance tree as a whole won't be collected. It is necessary to
run ANALYZE
on the parent table manually in order to
keep the statistics up to date.
As with vacuuming for space recovery, frequent updates of statistics
are more useful for heavily-updated tables than for seldom-updated
ones. But even for a heavily-updated table, there might be no need for
statistics updates if the statistical distribution of the data is
not changing much. A simple rule of thumb is to think about how much
the minimum and maximum values of the columns in the table change.
For example, a timestamp
column that contains the time
of row update will have a constantly-increasing maximum value as
rows are added and updated; such a column will probably need more
frequent statistics updates than, say, a column containing URLs for
pages accessed on a website. The URL column might receive changes just
as often, but the statistical distribution of its values probably
changes relatively slowly.
It is possible to run ANALYZE
on specific tables and even
just specific columns of a table, so the flexibility exists to update some
statistics more frequently than others if your application requires it.
In practice, however, it is usually best to just analyze the entire
database, because it is a fast operation. ANALYZE
uses a
statistically random sampling of the rows of a table rather than reading
every single row.
Although per-column tweaking of ANALYZE
frequency might not be
very productive, you might find it worthwhile to do per-column
adjustment of the level of detail of the statistics collected by
ANALYZE
. Columns that are heavily used in WHERE
clauses and have highly irregular data distributions might require a
finer-grain data histogram than other columns. See ALTER TABLE
SET STATISTICS
, or change the database-wide default using the default_statistics_target configuration parameter.
Also, by default there is limited information available about the selectivity of functions. However, if you create a statistics object or an expression index that uses a function call, useful statistics will be gathered about the function, which can greatly improve query plans that use the expression index.
The autovacuum daemon does not issue ANALYZE
commands for
foreign tables, since it has no means of determining how often that
might be useful. If your queries require statistics on foreign tables
for proper planning, it's a good idea to run manually-managed
ANALYZE
commands on those tables on a suitable schedule.
The autovacuum daemon does not issue ANALYZE
commands
for partitioned tables. Inheritance parents will only be analyzed if the
parent itself is changed - changes to child tables do not trigger
autoanalyze on the parent table. If your queries require statistics on
parent tables for proper planning, it is necessary to periodically run
a manual ANALYZE
on those tables to keep the statistics
up to date.
Vacuum maintains a visibility map for each table to keep track of which pages contain only tuples that are known to be visible to all active transactions (and all future transactions, until the page is again modified). This has two purposes. First, vacuum itself can skip such pages on the next run, since there is nothing to clean up.
Second, it allows PostgreSQL to answer some queries using only the index, without reference to the underlying table. Since PostgreSQL indexes don't contain tuple visibility information, a normal index scan fetches the heap tuple for each matching index entry, to check whether it should be seen by the current transaction. An index-only scan, on the other hand, checks the visibility map first. If it's known that all tuples on the page are visible, the heap fetch can be skipped. This is most useful on large data sets where the visibility map can prevent disk accesses. The visibility map is vastly smaller than the heap, so it can easily be cached even when the heap is very large.
PostgreSQL's MVCC transaction semantics depend on being able to compare transaction ID (XID) numbers: a row version with an insertion XID greater than the current transaction's XID is “in the future” and should not be visible to the current transaction. But since transaction IDs have limited size (32 bits) a cluster that runs for a long time (more than 4 billion transactions) would suffer transaction ID wraparound: the XID counter wraps around to zero, and all of a sudden transactions that were in the past appear to be in the future — which means their output become invisible. In short, catastrophic data loss. (Actually the data is still there, but that's cold comfort if you cannot get at it.) To avoid this, it is necessary to vacuum every table in every database at least once every two billion transactions.
The reason that periodic vacuuming solves the problem is that
VACUUM
will mark rows as frozen, indicating that
they were inserted by a transaction that committed sufficiently far in
the past that the effects of the inserting transaction are certain to be
visible to all current and future transactions.
Normal XIDs are
compared using modulo-232 arithmetic. This means
that for every normal XID, there are two billion XIDs that are
“older” and two billion that are “newer”; another
way to say it is that the normal XID space is circular with no
endpoint. Therefore, once a row version has been created with a particular
normal XID, the row version will appear to be “in the past” for
the next two billion transactions, no matter which normal XID we are
talking about. If the row version still exists after more than two billion
transactions, it will suddenly appear to be in the future. To
prevent this, PostgreSQL reserves a special XID,
FrozenTransactionId
, which does not follow the normal XID
comparison rules and is always considered older
than every normal XID.
Frozen row versions are treated as if the inserting XID were
FrozenTransactionId
, so that they will appear to be
“in the past” to all normal transactions regardless of wraparound
issues, and so such row versions will be valid until deleted, no matter
how long that is.
In PostgreSQL versions before 9.4, freezing was
implemented by actually replacing a row's insertion XID
with FrozenTransactionId
, which was visible in the
row's xmin
system column. Newer versions just set a flag
bit, preserving the row's original xmin
for possible
forensic use. However, rows with xmin
equal
to FrozenTransactionId
(2) may still be found
in databases pg_upgrade'd from pre-9.4 versions.
Also, system catalogs may contain rows with xmin
equal
to BootstrapTransactionId
(1), indicating that they were
inserted during the first phase of initdb.
Like FrozenTransactionId
, this special XID is treated as
older than every normal XID.
vacuum_freeze_min_age controls how old an XID value has to be before rows bearing that XID will be frozen. Increasing this setting may avoid unnecessary work if the rows that would otherwise be frozen will soon be modified again, but decreasing this setting increases the number of transactions that can elapse before the table must be vacuumed again.
VACUUM
uses the visibility map
to determine which pages of a table must be scanned. Normally, it
will skip pages that don't have any dead row versions even if those pages
might still have row versions with old XID values. Therefore, normal
VACUUM
s won't always freeze every old row version in the table.
Periodically, VACUUM
will perform an aggressive
vacuum, skipping only those pages which contain neither dead rows nor
any unfrozen XID or MXID values.
vacuum_freeze_table_age
controls when VACUUM
does that: all-visible but not all-frozen
pages are scanned if the number of transactions that have passed since the
last such scan is greater than vacuum_freeze_table_age
minus
vacuum_freeze_min_age
. Setting
vacuum_freeze_table_age
to 0 forces VACUUM
to
use this more aggressive strategy for all scans.
The maximum time that a table can go unvacuumed is two billion
transactions minus the vacuum_freeze_min_age
value at
the time of the last aggressive vacuum. If it were to go
unvacuumed for longer than
that, data loss could result. To ensure that this does not happen,
autovacuum is invoked on any table that might contain unfrozen rows with
XIDs older than the age specified by the configuration parameter autovacuum_freeze_max_age. (This will happen even if
autovacuum is disabled.)
This implies that if a table is not otherwise vacuumed,
autovacuum will be invoked on it approximately once every
autovacuum_freeze_max_age
minus
vacuum_freeze_min_age
transactions.
For tables that are regularly vacuumed for space reclamation purposes,
this is of little importance. However, for static tables
(including tables that receive inserts, but no updates or deletes),
there is no need to vacuum for space reclamation, so it can
be useful to try to maximize the interval between forced autovacuums
on very large static tables. Obviously one can do this either by
increasing autovacuum_freeze_max_age
or decreasing
vacuum_freeze_min_age
.
The effective maximum for vacuum_freeze_table_age
is 0.95 *
autovacuum_freeze_max_age
; a setting higher than that will be
capped to the maximum. A value higher than
autovacuum_freeze_max_age
wouldn't make sense because an
anti-wraparound autovacuum would be triggered at that point anyway, and
the 0.95 multiplier leaves some breathing room to run a manual
VACUUM
before that happens. As a rule of thumb,
vacuum_freeze_table_age
should be set to a value somewhat
below autovacuum_freeze_max_age
, leaving enough gap so that
a regularly scheduled VACUUM
or an autovacuum triggered by
normal delete and update activity is run in that window. Setting it too
close could lead to anti-wraparound autovacuums, even though the table
was recently vacuumed to reclaim space, whereas lower values lead to more
frequent aggressive vacuuming.
The sole disadvantage of increasing autovacuum_freeze_max_age
(and vacuum_freeze_table_age
along with it) is that
the pg_xact
and pg_commit_ts
subdirectories of the database cluster will take more space, because it
must store the commit status and (if track_commit_timestamp
is
enabled) timestamp of all transactions back to
the autovacuum_freeze_max_age
horizon. The commit status uses
two bits per transaction, so if
autovacuum_freeze_max_age
is set to its maximum allowed value
of two billion, pg_xact
can be expected to grow to about half
a gigabyte and pg_commit_ts
to about 20GB. If this
is trivial compared to your total database size,
setting autovacuum_freeze_max_age
to its maximum allowed value
is recommended. Otherwise, set it depending on what you are willing to
allow for pg_xact
and pg_commit_ts
storage.
(The default, 200 million transactions, translates to about 50MB
of pg_xact
storage and about 2GB of pg_commit_ts
storage.)
One disadvantage of decreasing vacuum_freeze_min_age
is that
it might cause VACUUM
to do useless work: freezing a row
version is a waste of time if the row is modified
soon thereafter (causing it to acquire a new XID). So the setting should
be large enough that rows are not frozen until they are unlikely to change
any more.
To track the age of the oldest unfrozen XIDs in a database,
VACUUM
stores XID
statistics in the system tables pg_class
and
pg_database
. In particular,
the relfrozenxid
column of a table's
pg_class
row contains the freeze cutoff XID that was used
by the last aggressive VACUUM
for that table. All rows
inserted by transactions with XIDs older than this cutoff XID are
guaranteed to have been frozen. Similarly,
the datfrozenxid
column of a database's
pg_database
row is a lower bound on the unfrozen XIDs
appearing in that database — it is just the minimum of the
per-table relfrozenxid
values within the database.
A convenient way to
examine this information is to execute queries such as:
SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm'); SELECT datname, age(datfrozenxid) FROM pg_database;
The age
column measures the number of transactions from the
cutoff XID to the current transaction's XID.
VACUUM
normally only scans pages that have been modified
since the last vacuum, but relfrozenxid
can only be
advanced when every page of the table
that might contain unfrozen XIDs is scanned. This happens when
relfrozenxid
is more than
vacuum_freeze_table_age
transactions old, when
VACUUM
's FREEZE
option is used, or when all
pages that are not already all-frozen happen to
require vacuuming to remove dead row versions. When VACUUM
scans every page in the table that is not already all-frozen, it should
set age(relfrozenxid)
to a value just a little more than the
vacuum_freeze_min_age
setting
that was used (more by the number of transactions started since the
VACUUM
started). If no relfrozenxid
-advancing
VACUUM
is issued on the table until
autovacuum_freeze_max_age
is reached, an autovacuum will soon
be forced for the table.
If for some reason autovacuum fails to clear old XIDs from a table, the system will begin to emit warning messages like this when the database's oldest XIDs reach forty million transactions from the wraparound point:
WARNING: database "mydb" must be vacuumed within 39985967 transactions HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
(A manual VACUUM
should fix the problem, as suggested by the
hint; but note that the VACUUM
should be performed by a
superuser, else it will fail to process system catalogs, which prevent it from
being able to advance the database's datfrozenxid
.)
If these warnings are ignored, the system will refuse to assign new XIDs once
there are fewer than three million transactions left until wraparound:
ERROR: database is not accepting commands to avoid wraparound data loss in database "mydb" HINT: Stop the postmaster and vacuum that database in single-user mode.
In this condition any transactions already in progress can continue,
but only read-only transactions can be started. Operations that
modify database records or truncate relations will fail.
The VACUUM
command can still be run normally.
Contrary to what the hint states, it is not necessary or desirable to stop the
postmaster or enter single user-mode in order to restore normal operation.
Instead, follow these steps:
age(transactionid)
is large. Such transactions should be
committed or rolled back.age(backend_xid)
or age(backend_xmin)
is
large. Such transactions should be committed or rolled back, or the session
can be terminated using pg_terminate_backend
.age(xmin)
or age(catalog_xmin)
is large. In many cases, such slots were created for replication to servers that no
longer exist, or that have been down for a long time. If you drop a slot for a server
that still exists and might still try to connect to that slot, that replica may
need to be rebuilt.VACUUM
in the target database. A database-wide
VACUUM
is simplest; to reduce the time required, it as also possible
to issue manual VACUUM
commands on the tables where
relminxid
is oldest. Do not use VACUUM FULL
in this scenario, because it requires an XID and will therefore fail, except in super-user
mode, where it will instead consume an XID and thus increase the risk of transaction ID
wraparound. Do not use VACUUM FREEZE
either, because it will do
more than the minimum amount of work required to restore normal operation.
In earlier versions, it was sometimes necessary to stop the postmaster and
VACUUM
the database in a single-user mode. In typical scenarios, this
is no longer necessary, and should be avoided whenever possible, since it involves taking
the system down. It is also riskier, since it disables transaction ID wraparound safeguards
that are designed to prevent data loss. The only reason to use single-user mode in this
scenario is if you wish to TRUNCATE
or DROP
unneeded
tables to avoid needing to VACUUM
them. The three-million-transaction
safety margin exists to let the administrator do this. See the
postgres reference page for details about using single-user mode.
Multixact IDs are used to support row locking by
multiple transactions. Since there is only limited space in a tuple
header to store lock information, that information is encoded as
a “multiple transaction ID”, or multixact ID for short,
whenever there is more than one transaction concurrently locking a
row. Information about which transaction IDs are included in any
particular multixact ID is stored separately in
the pg_multixact
subdirectory, and only the multixact ID
appears in the xmax
field in the tuple header.
Like transaction IDs, multixact IDs are implemented as a
32-bit counter and corresponding storage, all of which requires
careful aging management, storage cleanup, and wraparound handling.
There is a separate storage area which holds the list of members in
each multixact, which also uses a 32-bit counter and which must also
be managed.
Whenever VACUUM
scans any part of a table, it will replace
any multixact ID it encounters which is older than
vacuum_multixact_freeze_min_age
by a different value, which can be the zero value, a single
transaction ID, or a newer multixact ID. For each table,
pg_class
.relminmxid
stores the oldest
possible multixact ID still appearing in any tuple of that table.
If this value is older than
vacuum_multixact_freeze_table_age, an aggressive
vacuum is forced. As discussed in the previous section, an aggressive
vacuum means that only those pages which are known to be all-frozen will
be skipped. mxid_age()
can be used on
pg_class
.relminmxid
to find its age.
Aggressive VACUUM
scans, regardless of
what causes them, enable advancing the value for that table.
Eventually, as all tables in all databases are scanned and their
oldest multixact values are advanced, on-disk storage for older
multixacts can be removed.
As a safety device, an aggressive vacuum scan will occur for any table whose multixact-age is greater than autovacuum_multixact_freeze_max_age. Also, if the storage occupied by multixacts members exceeds 2GB, aggressive vacuum scans will occur more often for all tables, starting with those that have the oldest multixact-age. Both of these kinds of aggressive scans will occur even if autovacuum is nominally disabled.
Similar to the XID case, if autovacuum fails to clear old MXIDs from a table, the system will begin to emit warning messages when the database's oldest MXIDs reach forty million transactions from the wraparound point. And, just as an the XID case, if these warnings are ignored, the system will refuse to generate new MXIDs once there are fewer than three million left until wraparound.
Normal operation when MXIDs are exhausted can be restored in much the same way as when XIDs are exhausted. Follow the same steps in the previous section, but with the following differences:
pg_stat_activity
; however, looking for old XIDs is still a good
way of determining which transactions are causing MXID wraparound problems.
PostgreSQL has an optional but highly
recommended feature called autovacuum,
whose purpose is to automate the execution of
VACUUM
and ANALYZE
commands.
When enabled, autovacuum checks for
tables that have had a large number of inserted, updated or deleted
tuples. These checks use the statistics collection facility;
therefore, autovacuum cannot be used unless track_counts is set to true
.
In the default configuration, autovacuuming is enabled and the related
configuration parameters are appropriately set.
The “autovacuum daemon” actually consists of multiple processes.
There is a persistent daemon process, called the
autovacuum launcher, which is in charge of starting
autovacuum worker processes for all databases. The
launcher will distribute the work across time, attempting to start one
worker within each database every autovacuum_naptime
seconds. (Therefore, if the installation has N
databases,
a new worker will be launched every
autovacuum_naptime
/N
seconds.)
A maximum of autovacuum_max_workers worker processes
are allowed to run at the same time. If there are more than
autovacuum_max_workers
databases to be processed,
the next database will be processed as soon as the first worker finishes.
Each worker process will check each table within its database and
execute VACUUM
and/or ANALYZE
as needed.
log_autovacuum_min_duration can be set to monitor
autovacuum workers' activity.
If several large tables all become eligible for vacuuming in a short amount of time, all autovacuum workers might become occupied with vacuuming those tables for a long period. This would result in other tables and databases not being vacuumed until a worker becomes available. There is no limit on how many workers might be in a single database, but workers do try to avoid repeating work that has already been done by other workers. Note that the number of running workers does not count towards max_connections or superuser_reserved_connections limits.
Tables whose relfrozenxid
value is more than
autovacuum_freeze_max_age transactions old are always
vacuumed (this also applies to those tables whose freeze max age has
been modified via storage parameters; see below). Otherwise, if the
number of tuples obsoleted since the last
VACUUM
exceeds the “vacuum threshold”, the
table is vacuumed. The vacuum threshold is defined as:
vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
where the vacuum base threshold is
autovacuum_vacuum_threshold,
the vacuum scale factor is
autovacuum_vacuum_scale_factor,
and the number of tuples is
pg_class
.reltuples
.
The table is also vacuumed if the number of tuples inserted since the last vacuum has exceeded the defined insert threshold, which is defined as:
vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
where the vacuum insert base threshold is
autovacuum_vacuum_insert_threshold,
and vacuum insert scale factor is
autovacuum_vacuum_insert_scale_factor.
Such vacuums may allow portions of the table to be marked as
all visible and also allow tuples to be frozen, which
can reduce the work required in subsequent vacuums.
For tables which receive INSERT
operations but no or
almost no UPDATE
/DELETE
operations,
it may be beneficial to lower the table's
autovacuum_freeze_min_age as this may allow
tuples to be frozen by earlier vacuums. The number of obsolete tuples and
the number of inserted tuples are obtained from the statistics collector;
it is a semi-accurate count updated by each UPDATE
,
DELETE
and INSERT
operation. (It is
only semi-accurate because some information might be lost under heavy
load.) If the relfrozenxid
value of the table
is more than vacuum_freeze_table_age
transactions old,
an aggressive vacuum is performed to freeze old tuples and advance
relfrozenxid
; otherwise, only pages that have been modified
since the last vacuum are scanned.
For analyze, a similar condition is used: the threshold, defined as:
analyze threshold = analyze base threshold + analyze scale factor * number of tuples
is compared to the total number of tuples inserted, updated, or deleted
since the last ANALYZE
.
Partitioned tables do not directly store tuples and consequently
are not processed by autovacuum. (Autovacuum does process table
partitions just like other tables.) Unfortunately, this means that
autovacuum does not run ANALYZE
on partitioned
tables, and this can cause suboptimal plans for queries that reference
partitioned table statistics. You can work around this problem by
manually running ANALYZE
on partitioned tables
when they are first populated, and again whenever the distribution
of data in their partitions changes significantly.
Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze operations should be performed via session SQL commands.
The default thresholds and scale factors are taken from
postgresql.conf
, but it is possible to override them
(and many other autovacuum control parameters) on a per-table basis; see
Storage Parameters for more information.
If a setting has been changed via a table's storage parameters, that value
is used when processing that table; otherwise the global settings are
used. See Section 20.10 for more details on
the global settings.
When multiple workers are running, the autovacuum cost delay parameters
(see Section 20.4.4) are
“balanced” among all the running workers, so that the
total I/O impact on the system is the same regardless of the number
of workers actually running. However, any workers processing tables whose
per-table autovacuum_vacuum_cost_delay
or
autovacuum_vacuum_cost_limit
storage parameters have been set
are not considered in the balancing algorithm.
Autovacuum workers generally don't block other commands. If a process
attempts to acquire a lock that conflicts with the
SHARE UPDATE EXCLUSIVE
lock held by autovacuum, lock
acquisition will interrupt the autovacuum. For conflicting lock modes,
see Table 13.2. However, if the autovacuum
is running to prevent transaction ID wraparound (i.e., the autovacuum query
name in the pg_stat_activity
view ends with
(to prevent wraparound)
), the autovacuum is not
automatically interrupted.
Regularly running commands that acquire locks conflicting with a
SHARE UPDATE EXCLUSIVE
lock (e.g., ANALYZE) can
effectively prevent autovacuums from ever completing.
In some situations it is worthwhile to rebuild indexes periodically with the REINDEX command or a series of individual rebuilding steps.
B-tree index pages that have become completely empty are reclaimed for re-use. However, there is still a possibility of inefficient use of space: if all but a few index keys on a page have been deleted, the page remains allocated. Therefore, a usage pattern in which most, but not all, keys in each range are eventually deleted will see poor use of space. For such usage patterns, periodic reindexing is recommended.
The potential for bloat in non-B-tree indexes has not been well researched. It is a good idea to periodically monitor the index's physical size when using any non-B-tree index type.
Also, for B-tree indexes, a freshly-constructed index is slightly faster to access than one that has been updated many times because logically adjacent pages are usually also physically adjacent in a newly built index. (This consideration does not apply to non-B-tree indexes.) It might be worthwhile to reindex periodically just to improve access speed.
REINDEX can be used safely and easily in all cases.
This command requires an ACCESS EXCLUSIVE
lock by
default, hence it is often preferable to execute it with its
CONCURRENTLY
option, which requires only a
SHARE UPDATE EXCLUSIVE
lock.
It is a good idea to save the database server's log output
somewhere, rather than just discarding it via /dev/null
.
The log output is invaluable when diagnosing
problems.
The server log can contain sensitive information and needs to be protected,
no matter how or where it is stored, or the destination to which it is routed.
For example, some DDL statements might contain plaintext passwords or other
authentication details. Logged statements at the ERROR
level might show the SQL source code for applications
and might also contain some parts of data rows. Recording data, events and
related information is the intended function of this facility, so this is
not a leakage or a bug. Please ensure the server logs are visible only to
appropriately authorized people.
Log output tends to be voluminous (especially at higher debug levels) so you won't want to save it indefinitely. You need to rotate the log files so that new log files are started and old ones removed after a reasonable period of time.
If you simply direct the stderr of
postgres
into a
file, you will have log output, but
the only way to truncate the log file is to stop and restart
the server. This might be acceptable if you are using
PostgreSQL in a development environment,
but few production servers would find this behavior acceptable.
A better approach is to send the server's
stderr output to some type of log rotation program.
There is a built-in log rotation facility, which you can use by
setting the configuration parameter logging_collector
to
true
in postgresql.conf
. The control
parameters for this program are described in Section 20.8.1. You can also use this approach
to capture the log data in machine readable CSV
(comma-separated values) format.
Alternatively, you might prefer to use an external log rotation
program if you have one that you are already using with other
server software. For example, the rotatelogs
tool included in the Apache distribution
can be used with PostgreSQL. One way to
do this is to pipe the server's
stderr output to the desired program.
If you start the server with
pg_ctl
, then stderr
is already redirected to stdout, so you just need a
pipe command, for example:
pg_ctl start | rotatelogs /var/log/pgsql_log 86400
You can combine these approaches by setting up logrotate
to collect log files produced by PostgreSQL built-in
logging collector. In this case, the logging collector defines the names and
location of the log files, while logrotate
periodically archives these files. When initiating log rotation,
logrotate must ensure that the application
sends further output to the new file. This is commonly done with a
postrotate
script that sends a SIGHUP
signal to the application, which then reopens the log file.
In PostgreSQL, you can run pg_ctl
with the logrotate
option instead. When the server receives
this command, the server either switches to a new log file or reopens the
existing file, depending on the logging configuration
(see Section 20.8.1).
When using static log file names, the server might fail to reopen the log
file if the max open file limit is reached or a file table overflow occurs.
In this case, log messages are sent to the old log file until a
successful log rotation. If logrotate is
configured to compress the log file and delete it, the server may lose
the messages logged in this time frame. To avoid this issue, you can
configure the logging collector to dynamically assign log file names
and use a prerotate
script to ignore open log files.
Another production-grade approach to managing log output is to
send it to syslog and let
syslog deal with file rotation. To do this, set the
configuration parameter log_destination
to syslog
(to log to syslog only) in
postgresql.conf
. Then you can send a SIGHUP
signal to the syslog daemon whenever you want to force it
to start writing a new log file. If you want to automate log
rotation, the logrotate program can be
configured to work with log files from
syslog.
On many systems, however, syslog is not very reliable,
particularly with large log messages; it might truncate or drop messages
just when you need them the most. Also, on Linux,
syslog will flush each message to disk, yielding poor
performance. (You can use a “-
” at the start of the file name
in the syslog configuration file to disable syncing.)
Note that all the solutions described above take care of starting new log files at configurable intervals, but they do not handle deletion of old, no-longer-useful log files. You will probably want to set up a batch job to periodically delete old log files. Another possibility is to configure the rotation program so that old log files are overwritten cyclically.
pgBadger is an external project that does sophisticated log file analysis. check_postgres provides Nagios alerts when important messages appear in the log files, as well as detection of many other extraordinary conditions.
Table of Contents
As with everything that contains valuable data, PostgreSQL databases should be backed up regularly. While the procedure is essentially simple, it is important to have a clear understanding of the underlying techniques and assumptions.
There are three fundamentally different approaches to backing up PostgreSQL data:
SQL dump
File system level backup
Continuous archiving
Each has its own strengths and weaknesses; each is discussed in turn in the following sections.
The idea behind this dump method is to generate a file with SQL commands that, when fed back to the server, will recreate the database in the same state as it was at the time of the dump. PostgreSQL provides the utility program pg_dump for this purpose. The basic usage of this command is:
pg_dumpdbname
>dumpfile
As you see, pg_dump writes its result to the standard output. We will see below how this can be useful. While the above command creates a text file, pg_dump can create files in other formats that allow for parallelism and more fine-grained control of object restoration.
pg_dump is a regular PostgreSQL
client application (albeit a particularly clever one). This means
that you can perform this backup procedure from any remote host that has
access to the database. But remember that pg_dump
does not operate with special permissions. In particular, it must
have read access to all tables that you want to back up, so in order
to back up the entire database you almost always have to run it as a
database superuser. (If you do not have sufficient privileges to back up
the entire database, you can still back up portions of the database to which
you do have access using options such as
-n
or schema
-t
.)
table
To specify which database server pg_dump should
contact, use the command line options -h
and host
-p
. The
default host is the local host or whatever your
port
PGHOST
environment variable specifies. Similarly,
the default port is indicated by the PGPORT
environment variable or, failing that, by the compiled-in default.
(Conveniently, the server will normally have the same compiled-in
default.)
Like any other PostgreSQL client application,
pg_dump will by default connect with the database
user name that is equal to the current operating system user name. To override
this, either specify the -U
option or set the
environment variable PGUSER
. Remember that
pg_dump connections are subject to the normal
client authentication mechanisms (which are described in Chapter 21).
An important advantage of pg_dump over the other backup methods described later is that pg_dump's output can generally be re-loaded into newer versions of PostgreSQL, whereas file-level backups and continuous archiving are both extremely server-version-specific. pg_dump is also the only method that will work when transferring a database to a different machine architecture, such as going from a 32-bit to a 64-bit server.
Dumps created by pg_dump are internally consistent,
meaning, the dump represents a snapshot of the database at the time
pg_dump began running. pg_dump does not
block other operations on the database while it is working.
(Exceptions are those operations that need to operate with an
exclusive lock, such as most forms of ALTER TABLE
.)
Text files created by pg_dump are intended to be read in by the psql program. The general command form to restore a dump is
psqldbname
<dumpfile
where dumpfile
is the
file output by the pg_dump command. The database dbname
will not be created by this
command, so you must create it yourself from template0
before executing psql (e.g., with
createdb -T template0
). psql
supports options similar to pg_dump for specifying
the database server to connect to and the user name to use. See
the psql reference page for more information.
Non-text file dumps are restored using the pg_restore utility.
dbname
Before restoring an SQL dump, all the users who own objects or were granted permissions on objects in the dumped database must already exist. If they do not, the restore will fail to recreate the objects with the original ownership and/or permissions. (Sometimes this is what you want, but usually it is not.)
By default, the psql script will continue to
execute after an SQL error is encountered. You might wish to run
psql with
the ON_ERROR_STOP
variable set to alter that
behavior and have psql exit with an
exit status of 3 if an SQL error occurs:
psql --set ON_ERROR_STOP=ondbname
<dumpfile
Either way, you will only have a partially restored database.
Alternatively, you can specify that the whole dump should be
restored as a single transaction, so the restore is either fully
completed or fully rolled back. This mode can be specified by
passing the -1
or --single-transaction
command-line options to psql. When using this
mode, be aware that even a minor error can rollback a
restore that has already run for many hours. However, that might
still be preferable to manually cleaning up a complex database
after a partially restored dump.
The ability of pg_dump and psql to write to or read from pipes makes it possible to dump a database directly from one server to another, for example:
pg_dump -hhost1
dbname
| psql -hhost2
dbname
The dumps produced by pg_dump are relative to
template0
. This means that any languages, procedures,
etc. added via template1
will also be dumped by
pg_dump. As a result, when restoring, if you are
using a customized template1
, you must create the
empty database from template0
, as in the example
above.
After restoring a backup, it is wise to run ANALYZE
on each
database so the query optimizer has useful statistics;
see Section 25.1.3
and Section 25.1.6 for more information.
For more advice on how to load large amounts of data
into PostgreSQL efficiently, refer to Section 14.4.
pg_dump dumps only a single database at a time, and it does not dump information about roles or tablespaces (because those are cluster-wide rather than per-database). To support convenient dumping of the entire contents of a database cluster, the pg_dumpall program is provided. pg_dumpall backs up each database in a given cluster, and also preserves cluster-wide data such as role and tablespace definitions. The basic usage of this command is:
pg_dumpall > dumpfile
The resulting dump can be restored with psql:
psql -f dumpfile
postgres
(Actually, you can specify any existing database name to start from,
but if you are loading into an empty cluster then postgres
should usually be used.) It is always necessary to have
database superuser access when restoring a pg_dumpall
dump, as that is required to restore the role and tablespace information.
If you use tablespaces, make sure that the tablespace paths in the
dump are appropriate for the new installation.
pg_dumpall works by emitting commands to re-create roles, tablespaces, and empty databases, then invoking pg_dump for each database. This means that while each database will be internally consistent, the snapshots of different databases are not synchronized.
Cluster-wide data can be dumped alone using the
pg_dumpall --globals-only
option.
This is necessary to fully backup the cluster if running the
pg_dump command on individual databases.
Some operating systems have maximum file size limits that cause problems when creating large pg_dump output files. Fortunately, pg_dump can write to the standard output, so you can use standard Unix tools to work around this potential problem. There are several possible methods:
Use compressed dumps. You can use your favorite compression program, for example gzip:
pg_dumpdbname
| gzip >filename
.gz
Reload with:
gunzip -cfilename
.gz | psqldbname
or:
catfilename
.gz | gunzip | psqldbname
Use split
.
The split
command
allows you to split the output into smaller files that are
acceptable in size to the underlying file system. For example, to
make 2 gigabyte chunks:
pg_dumpdbname
| split -b 2G -filename
Reload with:
catfilename
* | psqldbname
If using GNU split, it is possible to use it and gzip together:
pg_dump dbname
| split -b 2G --filter='gzip > $FILE.gz'
It can be restored using zcat
.
Use pg_dump's custom dump format.
If PostgreSQL was built on a system with the
zlib compression library installed, the custom dump
format will compress data as it writes it to the output file. This will
produce dump file sizes similar to using gzip
, but it
has the added advantage that tables can be restored selectively. The
following command dumps a database using the custom dump format:
pg_dump -Fcdbname
>filename
A custom-format dump is not a script for psql, but instead must be restored with pg_restore, for example:
pg_restore -ddbname
filename
See the pg_dump and pg_restore reference pages for details.
For very large databases, you might need to combine split
with one of the other two approaches.
Use pg_dump's parallel dump feature.
To speed up the dump of a large database, you can use
pg_dump's parallel mode. This will dump
multiple tables at the same time. You can control the degree of
parallelism with the -j
parameter. Parallel dumps
are only supported for the "directory" archive format.
pg_dump -jnum
-F d -fout.dir
dbname
You can use pg_restore -j
to restore a dump in parallel.
This will work for any archive of either the "custom" or the "directory"
archive mode, whether or not it has been created with pg_dump -j
.
An alternative backup strategy is to directly copy the files that PostgreSQL uses to store the data in the database; Section 19.2 explains where these files are located. You can use whatever method you prefer for doing file system backups; for example:
tar -cf backup.tar /usr/local/pgsql/data
There are two restrictions, however, which make this method impractical, or at least inferior to the pg_dump method:
The database server must be shut down in order to
get a usable backup. Half-way measures such as disallowing all
connections will not work
(in part because tar
and similar tools do not take
an atomic snapshot of the state of the file system,
but also because of internal buffering within the server).
Information about stopping the server can be found in
Section 19.5. Needless to say, you
also need to shut down the server before restoring the data.
If you have dug into the details of the file system layout of the
database, you might be tempted to try to back up or restore only certain
individual tables or databases from their respective files or
directories. This will not work because the
information contained in these files is not usable without
the commit log files,
pg_xact/*
, which contain the commit status of
all transactions. A table file is only usable with this
information. Of course it is also impossible to restore only a
table and the associated pg_xact
data
because that would render all other tables in the database
cluster useless. So file system backups only work for complete
backup and restoration of an entire database cluster.
An alternative file-system backup approach is to make a
“consistent snapshot” of the data directory, if the
file system supports that functionality (and you are willing to
trust that it is implemented correctly). The typical procedure is
to make a “frozen snapshot” of the volume containing the
database, then copy the whole data directory (not just parts, see
above) from the snapshot to a backup device, then release the frozen
snapshot. This will work even while the database server is running.
However, a backup created in this way saves
the database files in a state as if the database server was not
properly shut down; therefore, when you start the database server
on the backed-up data, it will think the previous server instance
crashed and will replay the WAL log. This is not a problem; just
be aware of it (and be sure to include the WAL files in your backup).
You can perform a CHECKPOINT
before taking the
snapshot to reduce recovery time.
If your database is spread across multiple file systems, there might not be any way to obtain exactly-simultaneous frozen snapshots of all the volumes. For example, if your data files and WAL log are on different disks, or if tablespaces are on different file systems, it might not be possible to use snapshot backup because the snapshots must be simultaneous. Read your file system documentation very carefully before trusting the consistent-snapshot technique in such situations.
If simultaneous snapshots are not possible, one option is to shut down the database server long enough to establish all the frozen snapshots. Another option is to perform a continuous archiving base backup (Section 26.3.2) because such backups are immune to file system changes during the backup. This requires enabling continuous archiving just during the backup process; restore is done using continuous archive recovery (Section 26.3.4).
Another option is to use rsync to perform a file
system backup. This is done by first running rsync
while the database server is running, then shutting down the database
server long enough to do an rsync --checksum
.
(--checksum
is necessary because rsync
only
has file modification-time granularity of one second.) The
second rsync will be quicker than the first,
because it has relatively little data to transfer, and the end result
will be consistent because the server was down. This method
allows a file system backup to be performed with minimal downtime.
Note that a file system backup will typically be larger than an SQL dump. (pg_dump does not need to dump the contents of indexes for example, just the commands to recreate them.) However, taking a file system backup might be faster.
At all times, PostgreSQL maintains a
write ahead log (WAL) in the pg_wal/
subdirectory of the cluster's data directory. The log records
every change made to the database's data files. This log exists
primarily for crash-safety purposes: if the system crashes, the
database can be restored to consistency by “replaying” the
log entries made since the last checkpoint. However, the existence
of the log makes it possible to use a third strategy for backing up
databases: we can combine a file-system-level backup with backup of
the WAL files. If recovery is needed, we restore the file system backup and
then replay from the backed-up WAL files to bring the system to a
current state. This approach is more complex to administer than
either of the previous approaches, but it has some significant
benefits:
We do not need a perfectly consistent file system backup as the starting point. Any internal inconsistency in the backup will be corrected by log replay (this is not significantly different from what happens during crash recovery). So we do not need a file system snapshot capability, just tar or a similar archiving tool.
Since we can combine an indefinitely long sequence of WAL files for replay, continuous backup can be achieved simply by continuing to archive the WAL files. This is particularly valuable for large databases, where it might not be convenient to take a full backup frequently.
It is not necessary to replay the WAL entries all the way to the end. We could stop the replay at any point and have a consistent snapshot of the database as it was at that time. Thus, this technique supports point-in-time recovery: it is possible to restore the database to its state at any time since your base backup was taken.
If we continuously feed the series of WAL files to another machine that has been loaded with the same base backup file, we have a warm standby system: at any point we can bring up the second machine and it will have a nearly-current copy of the database.
pg_dump and pg_dumpall do not produce file-system-level backups and cannot be used as part of a continuous-archiving solution. Such dumps are logical and do not contain enough information to be used by WAL replay.
As with the plain file-system-backup technique, this method can only support restoration of an entire database cluster, not a subset. Also, it requires a lot of archival storage: the base backup might be bulky, and a busy system will generate many megabytes of WAL traffic that have to be archived. Still, it is the preferred backup technique in many situations where high reliability is needed.
To recover successfully using continuous archiving (also called “online backup” by many database vendors), you need a continuous sequence of archived WAL files that extends back at least as far as the start time of your backup. So to get started, you should set up and test your procedure for archiving WAL files before you take your first base backup. Accordingly, we first discuss the mechanics of archiving WAL files.
In an abstract sense, a running PostgreSQL system produces an indefinitely long sequence of WAL records. The system physically divides this sequence into WAL segment files, which are normally 16MB apiece (although the segment size can be altered during initdb). The segment files are given numeric names that reflect their position in the abstract WAL sequence. When not using WAL archiving, the system normally creates just a few segment files and then “recycles” them by renaming no-longer-needed segment files to higher segment numbers. It's assumed that segment files whose contents precede the last checkpoint are no longer of interest and can be recycled.
When archiving WAL data, we need to capture the contents of each segment
file once it is filled, and save that data somewhere before the segment
file is recycled for reuse. Depending on the application and the
available hardware, there could be many different ways of “saving
the data somewhere”: we could copy the segment files to an NFS-mounted
directory on another machine, write them onto a tape drive (ensuring that
you have a way of identifying the original name of each file), or batch
them together and burn them onto CDs, or something else entirely. To
provide the database administrator with flexibility,
PostgreSQL tries not to make any assumptions about how
the archiving will be done. Instead, PostgreSQL lets
the administrator specify a shell command to be executed to copy a
completed segment file to wherever it needs to go. The command could be
as simple as a cp
, or it could invoke a complex shell
script — it's all up to you.
To enable WAL archiving, set the wal_level
configuration parameter to replica
or higher,
archive_mode to on
,
and specify the shell command to use in the archive_command configuration parameter. In practice
these settings will always be placed in the
postgresql.conf
file.
In archive_command
,
%p
is replaced by the path name of the file to
archive, while %f
is replaced by only the file name.
(The path name is relative to the current working directory,
i.e., the cluster's data directory.)
Use %%
if you need to embed an actual %
character in the command. The simplest useful command is something
like:
archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' # Unix archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"' # Windows
which will copy archivable WAL segments to the directory
/mnt/server/archivedir
. (This is an example, not a
recommendation, and might not work on all platforms.) After the
%p
and %f
parameters have been replaced,
the actual command executed might look like this:
test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/00000001000000A900000065 /mnt/server/archivedir/00000001000000A900000065
A similar command will be generated for each new file to be archived.
The archive command will be executed under the ownership of the same user that the PostgreSQL server is running as. Since the series of WAL files being archived contains effectively everything in your database, you will want to be sure that the archived data is protected from prying eyes; for example, archive into a directory that does not have group or world read access.
It is important that the archive command return zero exit status if and only if it succeeds. Upon getting a zero result, PostgreSQL will assume that the file has been successfully archived, and will remove or recycle it. However, a nonzero status tells PostgreSQL that the file was not archived; it will try again periodically until it succeeds.
When the archive command is terminated by a signal (other than SIGTERM that is used as part of a server shutdown) or an error by the shell with an exit status greater than 125 (such as command not found), the archiver process aborts and gets restarted by the postmaster. In such cases, the failure is not reported in pg_stat_archiver.
The archive command should generally be designed to refuse to overwrite any pre-existing archive file. This is an important safety feature to preserve the integrity of your archive in case of administrator error (such as sending the output of two different servers to the same archive directory).
It is advisable to test your proposed archive command to ensure that it
indeed does not overwrite an existing file, and that it returns
nonzero status in this case.
The example command above for Unix ensures this by including a separate
test
step. On some Unix platforms, cp
has
switches such as -i
that can be used to do the same thing
less verbosely, but you should not rely on these without verifying that
the right exit status is returned. (In particular, GNU cp
will return status zero when -i
is used and the target file
already exists, which is not the desired behavior.)
While designing your archiving setup, consider what will happen if
the archive command fails repeatedly because some aspect requires
operator intervention or the archive runs out of space. For example, this
could occur if you write to tape without an autochanger; when the tape
fills, nothing further can be archived until the tape is swapped.
You should ensure that any error condition or request to a human operator
is reported appropriately so that the situation can be
resolved reasonably quickly. The pg_wal/
directory will
continue to fill with WAL segment files until the situation is resolved.
(If the file system containing pg_wal/
fills up,
PostgreSQL will do a PANIC shutdown. No committed
transactions will be lost, but the database will remain offline until
you free some space.)
The speed of the archiving command is unimportant as long as it can keep up
with the average rate at which your server generates WAL data. Normal
operation continues even if the archiving process falls a little behind.
If archiving falls significantly behind, this will increase the amount of
data that would be lost in the event of a disaster. It will also mean that
the pg_wal/
directory will contain large numbers of
not-yet-archived segment files, which could eventually exceed available
disk space. You are advised to monitor the archiving process to ensure that
it is working as you intend.
In writing your archive command, you should assume that the file names to
be archived can be up to 64 characters long and can contain any
combination of ASCII letters, digits, and dots. It is not necessary to
preserve the original relative path (%p
) but it is necessary to
preserve the file name (%f
).
Note that although WAL archiving will allow you to restore any
modifications made to the data in your PostgreSQL database,
it will not restore changes made to configuration files (that is,
postgresql.conf
, pg_hba.conf
and
pg_ident.conf
), since those are edited manually rather
than through SQL operations.
You might wish to keep the configuration files in a location that will
be backed up by your regular file system backup procedures. See
Section 20.2 for how to relocate the
configuration files.
The archive command is only invoked on completed WAL segments. Hence,
if your server generates only little WAL traffic (or has slack periods
where it does so), there could be a long delay between the completion
of a transaction and its safe recording in archive storage. To put
a limit on how old unarchived data can be, you can set
archive_timeout to force the server to switch
to a new WAL segment file at least that often. Note that archived
files that are archived early due to a forced switch are still the same
length as completely full files. It is therefore unwise to set a very
short archive_timeout
— it will bloat your archive
storage. archive_timeout
settings of a minute or so are
usually reasonable.
Also, you can force a segment switch manually with
pg_switch_wal
if you want to ensure that a
just-finished transaction is archived as soon as possible. Other utility
functions related to WAL management are listed in Table 9.87.
When wal_level
is minimal
some SQL commands
are optimized to avoid WAL logging, as described in Section 14.4.7. If archiving or streaming replication were
turned on during execution of one of these statements, WAL would not
contain enough information for archive recovery. (Crash recovery is
unaffected.) For this reason, wal_level
can only be changed at
server start. However, archive_command
can be changed with a
configuration file reload. If you wish to temporarily stop archiving,
one way to do it is to set archive_command
to the empty
string (''
).
This will cause WAL files to accumulate in pg_wal/
until a
working archive_command
is re-established.
The easiest way to perform a base backup is to use the pg_basebackup tool. It can create a base backup either as regular files or as a tar archive. If more flexibility than pg_basebackup can provide is required, you can also make a base backup using the low level API (see Section 26.3.3).
It is not necessary to be concerned about the amount of time it takes
to make a base backup. However, if you normally run the
server with full_page_writes
disabled, you might notice a drop
in performance while the backup runs since full_page_writes
is
effectively forced on during backup mode.
To make use of the backup, you will need to keep all the WAL
segment files generated during and after the file system backup.
To aid you in doing this, the base backup process
creates a backup history file that is immediately
stored into the WAL archive area. This file is named after the first
WAL segment file that you need for the file system backup.
For example, if the starting WAL file is
0000000100001234000055CD
the backup history file will be
named something like
0000000100001234000055CD.007C9330.backup
. (The second
part of the file name stands for an exact position within the WAL
file, and can ordinarily be ignored.) Once you have safely archived
the file system backup and the WAL segment files used during the
backup (as specified in the backup history file), all archived WAL
segments with names numerically less are no longer needed to recover
the file system backup and can be deleted. However, you should
consider keeping several backup sets to be absolutely certain that
you can recover your data.
The backup history file is just a small text file. It contains the label string you gave to pg_basebackup, as well as the starting and ending times and WAL segments of the backup. If you used the label to identify the associated dump file, then the archived history file is enough to tell you which dump file to restore.
Since you have to keep around all the archived WAL files back to your last base backup, the interval between base backups should usually be chosen based on how much storage you want to expend on archived WAL files. You should also consider how long you are prepared to spend recovering, if recovery should be necessary — the system will have to replay all those WAL segments, and that could take awhile if it has been a long time since the last base backup.
The procedure for making a base backup using the low level APIs contains a few more steps than the pg_basebackup method, but is relatively simple. It is very important that these steps are executed in sequence, and that the success of a step is verified before proceeding to the next step.
Low level base backups can be made in a non-exclusive or an exclusive way. The non-exclusive method is recommended and the exclusive one is deprecated and will eventually be removed.
A non-exclusive low level backup is one that allows other concurrent backups to be running (both those started using the same backup API and those started using pg_basebackup).
Ensure that WAL archiving is enabled and working.
Connect to the server (it does not matter which database) as a user with rights to run pg_start_backup (superuser, or a user who has been granted EXECUTE on the function) and issue the command:
SELECT pg_start_backup('label', false, false);
where label
is any string you want to use to uniquely
identify this backup operation. The connection
calling pg_start_backup
must be maintained until the end of
the backup, or the backup will be automatically aborted.
By default, pg_start_backup
can take a long time to finish.
This is because it performs a checkpoint, and the I/O
required for the checkpoint will be spread out over a significant
period of time, by default half your inter-checkpoint interval
(see the configuration parameter
checkpoint_completion_target). This is
usually what you want, because it minimizes the impact on query
processing. If you want to start the backup as soon as
possible, change the second parameter to true
, which will
issue an immediate checkpoint using as much I/O as available.
The third parameter being false
tells
pg_start_backup
to initiate a non-exclusive base backup.
Perform the backup, using any convenient file-system-backup tool such as tar or cpio (not pg_dump or pg_dumpall). It is neither necessary nor desirable to stop normal operation of the database while you do this. See Section 26.3.3.3 for things to consider during this backup.
In the same connection as before, issue the command:
SELECT * FROM pg_stop_backup(false, true);
This terminates backup mode. On a primary, it also performs an automatic
switch to the next WAL segment. On a standby, it is not possible to
automatically switch WAL segments, so you may wish to run
pg_switch_wal
on the primary to perform a manual
switch. The reason for the switch is to arrange for
the last WAL segment file written during the backup interval to be
ready to archive.
The pg_stop_backup
will return one row with three
values. The second of these fields should be written to a file named
backup_label
in the root directory of the backup. The
third field should be written to a file named
tablespace_map
unless the field is empty. These files are
vital to the backup working and must be written byte for byte without
modification, which may require opening the file in binary mode.
Once the WAL segment files active during the backup are archived, you are
done. The file identified by pg_stop_backup
's first return
value is the last segment that is required to form a complete set of
backup files. On a primary, if archive_mode
is enabled and the
wait_for_archive
parameter is true
,
pg_stop_backup
does not return until the last segment has
been archived.
On a standby, archive_mode
must be always
in order
for pg_stop_backup
to wait.
Archiving of these files happens automatically since you have
already configured archive_command
. In most cases this
happens quickly, but you are advised to monitor your archive
system to ensure there are no delays.
If the archive process has fallen behind
because of failures of the archive command, it will keep retrying
until the archive succeeds and the backup is complete.
If you wish to place a time limit on the execution of
pg_stop_backup
, set an appropriate
statement_timeout
value, but make note that if
pg_stop_backup
terminates because of this your backup
may not be valid.
If the backup process monitors and ensures that all WAL segment files
required for the backup are successfully archived then the
wait_for_archive
parameter (which defaults to true) can be set
to false to have
pg_stop_backup
return as soon as the stop backup record is
written to the WAL. By default, pg_stop_backup
will wait
until all WAL has been archived, which can take some time. This option
must be used with caution: if WAL archiving is not monitored correctly
then the backup might not include all of the WAL files and will
therefore be incomplete and not able to be restored.
The exclusive backup method is deprecated and should be avoided. Prior to PostgreSQL 9.6, this was the only low-level method available, but it is now recommended that all users upgrade their scripts to use non-exclusive backups.
The process for an exclusive backup is mostly the same as for a non-exclusive one, but it differs in a few key steps. This type of backup can only be taken on a primary and does not allow concurrent backups. Moreover, because it creates a backup label file, as described below, it can block automatic restart of the primary server after a crash. On the other hand, the erroneous removal of this file from a backup or standby is a common mistake, which can result in serious data corruption. If it is necessary to use this method, the following steps may be used.
Ensure that WAL archiving is enabled and working.
Connect to the server (it does not matter which database) as a user with rights to run pg_start_backup (superuser, or a user who has been granted EXECUTE on the function) and issue the command:
SELECT pg_start_backup('label');
where label
is any string you want to use to uniquely
identify this backup operation.
pg_start_backup
creates a backup label file,
called backup_label
, in the cluster directory with
information about your backup, including the start time and label string.
The function also creates a tablespace map file,
called tablespace_map
, in the cluster directory with
information about tablespace symbolic links in pg_tblspc/
if
one or more such link is present. Both files are critical to the
integrity of the backup, should you need to restore from it.
By default, pg_start_backup
can take a long time to finish.
This is because it performs a checkpoint, and the I/O
required for the checkpoint will be spread out over a significant
period of time, by default half your inter-checkpoint interval
(see the configuration parameter
checkpoint_completion_target). This is
usually what you want, because it minimizes the impact on query
processing. If you want to start the backup as soon as
possible, use:
SELECT pg_start_backup('label', true);
This forces the checkpoint to be done as quickly as possible.
Perform the backup, using any convenient file-system-backup tool such as tar or cpio (not pg_dump or pg_dumpall). It is neither necessary nor desirable to stop normal operation of the database while you do this. See Section 26.3.3.3 for things to consider during this backup.
As noted above, if the server crashes during the backup it may not be
possible to restart until the backup_label
file has
been manually deleted from the PGDATA
directory. Note
that it is very important to never remove the
backup_label
file when restoring a backup, because
this will result in corruption. Confusion about when it is appropriate
to remove this file is a common cause of data corruption when using this
method; be very certain that you remove the file only on an existing
primary and never when building a standby or restoring a backup, even if
you are building a standby that will subsequently be promoted to a new
primary.
Again connect to the database as a user with rights to run pg_stop_backup (superuser, or a user who has been granted EXECUTE on the function), and issue the command:
SELECT pg_stop_backup();
This function terminates backup mode and performs an automatic switch to the next WAL segment. The reason for the switch is to arrange for the last WAL segment written during the backup interval to be ready to archive.
Once the WAL segment files active during the backup are archived, you are
done. The file identified by pg_stop_backup
's result is
the last segment that is required to form a complete set of backup files.
If archive_mode
is enabled,
pg_stop_backup
does not return until the last segment has
been archived.
Archiving of these files happens automatically since you have
already configured archive_command
. In most cases this
happens quickly, but you are advised to monitor your archive
system to ensure there are no delays.
If the archive process has fallen behind
because of failures of the archive command, it will keep retrying
until the archive succeeds and the backup is complete.
When using exclusive backup mode, it is absolutely imperative to ensure
that pg_stop_backup
completes successfully at the
end of the backup. Even if the backup itself fails, for example due to
lack of disk space, failure to call pg_stop_backup
will leave the server in backup mode indefinitely, causing future backups
to fail and increasing the risk of a restart failure during the time that
backup_label
exists.
Some file system backup tools emit warnings or errors
if the files they are trying to copy change while the copy proceeds.
When taking a base backup of an active database, this situation is normal
and not an error. However, you need to ensure that you can distinguish
complaints of this sort from real errors. For example, some versions
of rsync return a separate exit code for
“vanished source files”, and you can write a driver script to
accept this exit code as a non-error case. Also, some versions of
GNU tar return an error code indistinguishable from
a fatal error if a file was truncated while tar was
copying it. Fortunately, GNU tar versions 1.16 and
later exit with 1 if a file was changed during the backup,
and 2 for other errors. With GNU tar version 1.23 and
later, you can use the warning options --warning=no-file-changed
--warning=no-file-removed
to hide the related warning messages.
Be certain that your backup includes all of the files under
the database cluster directory (e.g., /usr/local/pgsql/data
).
If you are using tablespaces that do not reside underneath this directory,
be careful to include them as well (and be sure that your backup
archives symbolic links as links, otherwise the restore will corrupt
your tablespaces).
You should, however, omit from the backup the files within the
cluster's pg_wal/
subdirectory. This
slight adjustment is worthwhile because it reduces the risk
of mistakes when restoring. This is easy to arrange if
pg_wal/
is a symbolic link pointing to someplace outside
the cluster directory, which is a common setup anyway for performance
reasons. You might also want to exclude postmaster.pid
and postmaster.opts
, which record information
about the running postmaster, not about the
postmaster which will eventually use this backup.
(These files can confuse pg_ctl.)
It is often a good idea to also omit from the backup the files
within the cluster's pg_replslot/
directory, so that
replication slots that exist on the primary do not become part of the
backup. Otherwise, the subsequent use of the backup to create a standby
may result in indefinite retention of WAL files on the standby, and
possibly bloat on the primary if hot standby feedback is enabled, because
the clients that are using those replication slots will still be connecting
to and updating the slots on the primary, not the standby. Even if the
backup is only intended for use in creating a new primary, copying the
replication slots isn't expected to be particularly useful, since the
contents of those slots will likely be badly out of date by the time
the new primary comes on line.
The contents of the directories pg_dynshmem/
,
pg_notify/
, pg_serial/
,
pg_snapshots/
, pg_stat_tmp/
,
and pg_subtrans/
(but not the directories themselves) can be
omitted from the backup as they will be initialized on postmaster startup.
If stats_temp_directory is set and is under the data
directory then the contents of that directory can also be omitted.
Any file or directory beginning with pgsql_tmp
can be
omitted from the backup. These files are removed on postmaster start and
the directories will be recreated as needed.
pg_internal.init
files can be omitted from the
backup whenever a file of that name is found. These files contain
relation cache data that is always rebuilt when recovering.
The backup label
file includes the label string you gave to pg_start_backup
,
as well as the time at which pg_start_backup
was run, and
the name of the starting WAL file. In case of confusion it is therefore
possible to look inside a backup file and determine exactly which
backup session the dump file came from. The tablespace map file includes
the symbolic link names as they exist in the directory
pg_tblspc/
and the full path of each symbolic link.
These files are not merely for your information; their presence and
contents are critical to the proper operation of the system's recovery
process.
It is also possible to make a backup while the server is
stopped. In this case, you obviously cannot use
pg_start_backup
or pg_stop_backup
, and
you will therefore be left to your own devices to keep track of which
backup is which and how far back the associated WAL files go.
It is generally better to follow the continuous archiving procedure above.
Okay, the worst has happened and you need to recover from your backup. Here is the procedure:
Stop the server, if it's running.
If you have the space to do so,
copy the whole cluster data directory and any tablespaces to a temporary
location in case you need them later. Note that this precaution will
require that you have enough free space on your system to hold two
copies of your existing database. If you do not have enough space,
you should at least save the contents of the cluster's pg_wal
subdirectory, as it might contain logs which
were not archived before the system went down.
Remove all existing files and subdirectories under the cluster data directory and under the root directories of any tablespaces you are using.
Restore the database files from your file system backup. Be sure that they
are restored with the right ownership (the database system user, not
root
!) and with the right permissions. If you are using
tablespaces,
you should verify that the symbolic links in pg_tblspc/
were correctly restored.
Remove any files present in pg_wal/
; these came from the
file system backup and are therefore probably obsolete rather than current.
If you didn't archive pg_wal/
at all, then recreate
it with proper permissions,
being careful to ensure that you re-establish it as a symbolic link
if you had it set up that way before.
If you have unarchived WAL segment files that you saved in step 2,
copy them into pg_wal/
. (It is best to copy them,
not move them, so you still have the unmodified files if a
problem occurs and you have to start over.)
Set recovery configuration settings in
postgresql.conf
(see Section 20.5.4) and create a file
recovery.signal
in the cluster
data directory. You might
also want to temporarily modify pg_hba.conf
to prevent
ordinary users from connecting until you are sure the recovery was successful.
Start the server. The server will go into recovery mode and
proceed to read through the archived WAL files it needs. Should the
recovery be terminated because of an external error, the server can
simply be restarted and it will continue recovery. Upon completion
of the recovery process, the server will remove
recovery.signal
(to prevent
accidentally re-entering recovery mode later) and then
commence normal database operations.
Inspect the contents of the database to ensure you have recovered to
the desired state. If not, return to step 1. If all is well,
allow your users to connect by restoring pg_hba.conf
to normal.
The key part of all this is to set up a recovery configuration that
describes how you want to recover and how far the recovery should
run. The one thing that you absolutely must specify is the restore_command
,
which tells PostgreSQL how to retrieve archived
WAL file segments. Like the archive_command
, this is
a shell command string. It can contain %f
, which is
replaced by the name of the desired log file, and %p
,
which is replaced by the path name to copy the log file to.
(The path name is relative to the current working directory,
i.e., the cluster's data directory.)
Write %%
if you need to embed an actual %
character in the command. The simplest useful command is
something like:
restore_command = 'cp /mnt/server/archivedir/%f %p'
which will copy previously archived WAL segments from the directory
/mnt/server/archivedir
. Of course, you can use something
much more complicated, perhaps even a shell script that requests the
operator to mount an appropriate tape.
It is important that the command return nonzero exit status on failure. The command will be called requesting files that are not present in the archive; it must return nonzero when so asked. This is not an error condition. An exception is that if the command was terminated by a signal (other than SIGTERM, which is used as part of a database server shutdown) or an error by the shell (such as command not found), then recovery will abort and the server will not start up.
Not all of the requested files will be WAL segment
files; you should also expect requests for files with a suffix of
.history
. Also be aware that
the base name of the %p
path will be different from
%f
; do not expect them to be interchangeable.
WAL segments that cannot be found in the archive will be sought in
pg_wal/
; this allows use of recent un-archived segments.
However, segments that are available from the archive will be used in
preference to files in pg_wal/
.
Normally, recovery will proceed through all available WAL segments,
thereby restoring the database to the current point in time (or as
close as possible given the available WAL segments). Therefore, a normal
recovery will end with a “file not found” message, the exact text
of the error message depending upon your choice of
restore_command
. You may also see an error message
at the start of recovery for a file named something like
00000001.history
. This is also normal and does not
indicate a problem in simple recovery situations; see
Section 26.3.5 for discussion.
If you want to recover to some previous point in time (say, right before the junior DBA dropped your main transaction table), just specify the required stopping point. You can specify the stop point, known as the “recovery target”, either by date/time, named restore point or by completion of a specific transaction ID. As of this writing only the date/time and named restore point options are very usable, since there are no tools to help you identify with any accuracy which transaction ID to use.
The stop point must be after the ending time of the base backup, i.e.,
the end time of pg_stop_backup
. You cannot use a base backup
to recover to a time when that backup was in progress. (To
recover to such a time, you must go back to your previous base backup
and roll forward from there.)
If recovery finds corrupted WAL data, recovery will
halt at that point and the server will not start. In such a case the
recovery process could be re-run from the beginning, specifying a
“recovery target” before the point of corruption so that recovery
can complete normally.
If recovery fails for an external reason, such as a system crash or
if the WAL archive has become inaccessible, then the recovery can simply
be restarted and it will restart almost from where it failed.
Recovery restart works much like checkpointing in normal operation:
the server periodically forces all its state to disk, and then updates
the pg_control
file to indicate that the already-processed
WAL data need not be scanned again.
The ability to restore the database to a previous point in time creates some complexities that are akin to science-fiction stories about time travel and parallel universes. For example, in the original history of the database, suppose you dropped a critical table at 5:15PM on Tuesday evening, but didn't realize your mistake until Wednesday noon. Unfazed, you get out your backup, restore to the point-in-time 5:14PM Tuesday evening, and are up and running. In this history of the database universe, you never dropped the table. But suppose you later realize this wasn't such a great idea, and would like to return to sometime Wednesday morning in the original history. You won't be able to if, while your database was up-and-running, it overwrote some of the WAL segment files that led up to the time you now wish you could get back to. Thus, to avoid this, you need to distinguish the series of WAL records generated after you've done a point-in-time recovery from those that were generated in the original database history.
To deal with this problem, PostgreSQL has a notion of timelines. Whenever an archive recovery completes, a new timeline is created to identify the series of WAL records generated after that recovery. The timeline ID number is part of WAL segment file names so a new timeline does not overwrite the WAL data generated by previous timelines. It is in fact possible to archive many different timelines. While that might seem like a useless feature, it's often a lifesaver. Consider the situation where you aren't quite sure what point-in-time to recover to, and so have to do several point-in-time recoveries by trial and error until you find the best place to branch off from the old history. Without timelines this process would soon generate an unmanageable mess. With timelines, you can recover to any prior state, including states in timeline branches that you abandoned earlier.
Every time a new timeline is created, PostgreSQL creates a “timeline history” file that shows which timeline it branched off from and when. These history files are necessary to allow the system to pick the right WAL segment files when recovering from an archive that contains multiple timelines. Therefore, they are archived into the WAL archive area just like WAL segment files. The history files are just small text files, so it's cheap and appropriate to keep them around indefinitely (unlike the segment files which are large). You can, if you like, add comments to a history file to record your own notes about how and why this particular timeline was created. Such comments will be especially valuable when you have a thicket of different timelines as a result of experimentation.
The default behavior of recovery is to recover to the latest timeline found
in the archive. If you wish to recover to the timeline that was current
when the base backup was taken or into a specific child timeline (that
is, you want to return to some state that was itself generated after a
recovery attempt), you need to specify current
or the
target timeline ID in recovery_target_timeline. You
cannot recover into timelines that branched off earlier than the base backup.
Some tips for configuring continuous archiving are given here.
It is possible to use PostgreSQL's backup facilities to produce standalone hot backups. These are backups that cannot be used for point-in-time recovery, yet are typically much faster to backup and restore than pg_dump dumps. (They are also much larger than pg_dump dumps, so in some cases the speed advantage might be negated.)
As with base backups, the easiest way to produce a standalone
hot backup is to use the pg_basebackup
tool. If you include the -X
parameter when calling
it, all the write-ahead log required to use the backup will be
included in the backup automatically, and no special action is
required to restore the backup.
If more flexibility in copying the backup files is needed, a lower
level process can be used for standalone hot backups as well.
To prepare for low level standalone hot backups, make sure
wal_level
is set to
replica
or higher, archive_mode
to
on
, and set up an archive_command
that performs
archiving only when a switch file exists. For example:
archive_command = 'test ! -f /var/lib/pgsql/backup_in_progress || (test ! -f /var/lib/pgsql/archive/%f && cp %p /var/lib/pgsql/archive/%f)'
This command will perform archiving when
/var/lib/pgsql/backup_in_progress
exists, and otherwise
silently return zero exit status (allowing PostgreSQL
to recycle the unwanted WAL file).
With this preparation, a backup can be taken using a script like the following:
touch /var/lib/pgsql/backup_in_progress psql -c "select pg_start_backup('hot_backup');" tar -cf /var/lib/pgsql/backup.tar /var/lib/pgsql/data/ psql -c "select pg_stop_backup();" rm /var/lib/pgsql/backup_in_progress tar -rf /var/lib/pgsql/backup.tar /var/lib/pgsql/archive/
The switch file /var/lib/pgsql/backup_in_progress
is
created first, enabling archiving of completed WAL files to occur.
After the backup the switch file is removed. Archived WAL files are
then added to the backup so that both base backup and all required
WAL files are part of the same tar file.
Please remember to add error handling to your backup scripts.
If archive storage size is a concern, you can use gzip to compress the archive files:
archive_command = 'gzip < %p > /mnt/server/archivedir/%f.gz'
You will then need to use gunzip during recovery:
restore_command = 'gunzip < /mnt/server/archivedir/%f.gz > %p'
archive_command
Scripts
Many people choose to use scripts to define their
archive_command
, so that their
postgresql.conf
entry looks very simple:
archive_command = 'local_backup_script.sh "%p" "%f"'
Using a separate script file is advisable any time you want to use more than a single command in the archiving process. This allows all complexity to be managed within the script, which can be written in a popular scripting language such as bash or perl.
Examples of requirements that might be solved within a script include:
Copying data to secure off-site data storage
Batching WAL files so that they are transferred every three hours, rather than one at a time
Interfacing with other backup and recovery software
Interfacing with monitoring software to report errors
When using an archive_command
script, it's desirable
to enable logging_collector.
Any messages written to stderr from the script will then
appear in the database server log, allowing complex configurations to
be diagnosed easily if they fail.
At this writing, there are several limitations of the continuous archiving technique. These will probably be fixed in future releases:
If a CREATE DATABASE
command is executed while a base backup is being taken, and then
the template database that the CREATE DATABASE
copied
is modified while the base backup is still in progress, it is
possible that recovery will cause those modifications to be
propagated into the created database as well. This is of course
undesirable. To avoid this risk, it is best not to modify any
template databases while taking a base backup.
CREATE TABLESPACE
commands are WAL-logged with the literal absolute path, and will
therefore be replayed as tablespace creations with the same
absolute path. This might be undesirable if the log is being
replayed on a different machine. It can be dangerous even if the
log is being replayed on the same machine, but into a new data
directory: the replay will still overwrite the contents of the
original tablespace. To avoid potential gotchas of this sort,
the best practice is to take a new base backup after creating or
dropping tablespaces.
It should also be noted that the default WAL
format is fairly bulky since it includes many disk page snapshots.
These page snapshots are designed to support crash recovery, since
we might need to fix partially-written disk pages. Depending on
your system hardware and software, the risk of partial writes might
be small enough to ignore, in which case you can significantly
reduce the total volume of archived logs by turning off page
snapshots using the full_page_writes
parameter. (Read the notes and warnings in Chapter 30
before you do so.) Turning off page snapshots does not prevent
use of the logs for PITR operations. An area for future
development is to compress archived WAL data by removing
unnecessary page copies even when full_page_writes
is
on. In the meantime, administrators might wish to reduce the number
of page snapshots included in WAL by increasing the checkpoint
interval parameters as much as feasible.
Table of Contents
Database servers can work together to allow a second server to take over quickly if the primary server fails (high availability), or to allow several computers to serve the same data (load balancing). Ideally, database servers could work together seamlessly. Web servers serving static web pages can be combined quite easily by merely load-balancing web requests to multiple machines. In fact, read-only database servers can be combined relatively easily too. Unfortunately, most database servers have a read/write mix of requests, and read/write servers are much harder to combine. This is because though read-only data needs to be placed on each server only once, a write to any server has to be propagated to all servers so that future read requests to those servers return consistent results.
This synchronization problem is the fundamental difficulty for servers working together. Because there is no single solution that eliminates the impact of the sync problem for all use cases, there are multiple solutions. Each solution addresses this problem in a different way, and minimizes its impact for a specific workload.
Some solutions deal with synchronization by allowing only one server to modify the data. Servers that can modify data are called read/write, master or primary servers. Servers that track changes in the primary are called standby or secondary servers. A standby server that cannot be connected to until it is promoted to a primary server is called a warm standby server, and one that can accept connections and serves read-only queries is called a hot standby server.
Some solutions are synchronous, meaning that a data-modifying transaction is not considered committed until all servers have committed the transaction. This guarantees that a failover will not lose any data and that all load-balanced servers will return consistent results no matter which server is queried. In contrast, asynchronous solutions allow some delay between the time of a commit and its propagation to the other servers, opening the possibility that some transactions might be lost in the switch to a backup server, and that load balanced servers might return slightly stale results. Asynchronous communication is used when synchronous would be too slow.
Solutions can also be categorized by their granularity. Some solutions can deal only with an entire database server, while others allow control at the per-table or per-database level.
Performance must be considered in any choice. There is usually a trade-off between functionality and performance. For example, a fully synchronous solution over a slow network might cut performance by more than half, while an asynchronous one might have a minimal performance impact.
The remainder of this section outlines various failover, replication, and load balancing solutions.
Shared disk failover avoids synchronization overhead by having only one copy of the database. It uses a single disk array that is shared by multiple servers. If the main database server fails, the standby server is able to mount and start the database as though it were recovering from a database crash. This allows rapid failover with no data loss.
Shared hardware functionality is common in network storage devices. Using a network file system is also possible, though care must be taken that the file system has full POSIX behavior (see Section 19.2.2.1). One significant limitation of this method is that if the shared disk array fails or becomes corrupt, the primary and standby servers are both nonfunctional. Another issue is that the standby server should never access the shared storage while the primary server is running.
A modified version of shared hardware functionality is file system replication, where all changes to a file system are mirrored to a file system residing on another computer. The only restriction is that the mirroring must be done in a way that ensures the standby server has a consistent copy of the file system — specifically, writes to the standby must be done in the same order as those on the primary. DRBD is a popular file system replication solution for Linux.
Warm and hot standby servers can be kept current by reading a stream of write-ahead log (WAL) records. If the main server fails, the standby contains almost all of the data of the main server, and can be quickly made the new primary database server. This can be synchronous or asynchronous and can only be done for the entire database server.
A standby server can be implemented using file-based log shipping (Section 27.2) or streaming replication (see Section 27.2.5), or a combination of both. For information on hot standby, see Section 27.4.
Logical replication allows a database server to send a stream of data modifications to another server. PostgreSQL logical replication constructs a stream of logical data modifications from the WAL. Logical replication allows replication of data changes on a per-table basis. In addition, a server that is publishing its own changes can also subscribe to changes from another server, allowing data to flow in multiple directions. For more information on logical replication, see Chapter 31. Through the logical decoding interface (Chapter 49), third-party extensions can also provide similar functionality.
A trigger-based replication setup typically funnels data modification queries to a designated primary server. Operating on a per-table basis, the primary server sends data changes (typically) asynchronously to the standby servers. Standby servers can answer queries while the primary is running, and may allow some local data changes or write activity. This form of replication is often used for offloading large analytical or data warehouse queries.
Slony-I is an example of this type of replication, with per-table granularity, and support for multiple standby servers. Because it updates the standby server asynchronously (in batches), there is possible data loss during fail over.
With SQL-based replication middleware, a program intercepts every SQL query and sends it to one or all servers. Each server operates independently. Read-write queries must be sent to all servers, so that every server receives any changes. But read-only queries can be sent to just one server, allowing the read workload to be distributed among them.
If queries are simply broadcast unmodified, functions like
random()
, CURRENT_TIMESTAMP
, and
sequences can have different values on different servers.
This is because each server operates independently, and because
SQL queries are broadcast rather than actual data changes. If
this is unacceptable, either the middleware or the application
must determine such values from a single source and then use those
values in write queries. Care must also be taken that all
transactions either commit or abort on all servers, perhaps
using two-phase commit (PREPARE TRANSACTION
and COMMIT PREPARED).
Pgpool-II and Continuent Tungsten
are examples of this type of replication.
For servers that are not regularly connected or have slow communication links, like laptops or remote servers, keeping data consistent among servers is a challenge. Using asynchronous multimaster replication, each server works independently, and periodically communicates with the other servers to identify conflicting transactions. The conflicts can be resolved by users or conflict resolution rules. Bucardo is an example of this type of replication.
In synchronous multimaster replication, each server can accept
write requests, and modified data is transmitted from the
original server to every other server before each transaction
commits. Heavy write activity can cause excessive locking and
commit delays, leading to poor performance. Read requests can
be sent to any server. Some implementations use shared disk
to reduce the communication overhead. Synchronous multimaster
replication is best for mostly read workloads, though its big
advantage is that any server can accept write requests —
there is no need to partition workloads between primary and
standby servers, and because the data changes are sent from one
server to another, there is no problem with non-deterministic
functions like random()
.
PostgreSQL does not offer this type of replication, though PostgreSQL two-phase commit (PREPARE TRANSACTION and COMMIT PREPARED) can be used to implement this in application code or middleware.
Table 27.1 summarizes the capabilities of the various solutions listed above.
Table 27.1. High Availability, Load Balancing, and Replication Feature Matrix
Feature | Shared Disk | File System Repl. | Write-Ahead Log Shipping | Logical Repl. | Trigger-Based Repl. | SQL Repl. Middle-ware | Async. MM Repl. | Sync. MM Repl. |
---|---|---|---|---|---|---|---|---|
Popular examples | NAS | DRBD | built-in streaming repl. | built-in logical repl., pglogical | Londiste, Slony | pgpool-II | Bucardo | |
Comm. method | shared disk | disk blocks | WAL | logical decoding | table rows | SQL | table rows | table rows and row locks |
No special hardware required | • | • | • | • | • | • | • | |
Allows multiple primary servers | • | • | • | • | ||||
No overhead on primary | • | • | • | • | ||||
No waiting for multiple servers | • | with sync off | with sync off | • | • | |||
Primary failure will never lose data | • | • | with sync on | with sync on | • | • | ||
Replicas accept read-only queries | with hot standby | • | • | • | • | • | ||
Per-table granularity | • | • | • | • | ||||
No conflict resolution necessary | • | • | • | • | • | • |
There are a few solutions that do not fit into the above categories:
Data partitioning splits tables into data sets. Each set can be modified by only one server. For example, data can be partitioned by offices, e.g., London and Paris, with a server in each office. If queries combining London and Paris data are necessary, an application can query both servers, or primary/standby replication can be used to keep a read-only copy of the other office's data on each server.
Many of the above solutions allow multiple servers to handle multiple queries, but none allow a single query to use multiple servers to complete faster. This solution allows multiple servers to work concurrently on a single query. It is usually accomplished by splitting the data among servers and having each server execute its part of the query and return results to a central server where they are combined and returned to the user. This can be implemented using the PL/Proxy tool set.
It should also be noted that because PostgreSQL is open source and easily extended, a number of companies have taken PostgreSQL and created commercial closed-source solutions with unique failover, replication, and load balancing capabilities. These are not discussed here.
Continuous archiving can be used to create a high availability (HA) cluster configuration with one or more standby servers ready to take over operations if the primary server fails. This capability is widely referred to as warm standby or log shipping.
The primary and standby server work together to provide this capability, though the servers are only loosely coupled. The primary server operates in continuous archiving mode, while each standby server operates in continuous recovery mode, reading the WAL files from the primary. No changes to the database tables are required to enable this capability, so it offers low administration overhead compared to some other replication solutions. This configuration also has relatively low performance impact on the primary server.
Directly moving WAL records from one database server to another is typically described as log shipping. PostgreSQL implements file-based log shipping by transferring WAL records one file (WAL segment) at a time. WAL files (16MB) can be shipped easily and cheaply over any distance, whether it be to an adjacent system, another system at the same site, or another system on the far side of the globe. The bandwidth required for this technique varies according to the transaction rate of the primary server. Record-based log shipping is more granular and streams WAL changes incrementally over a network connection (see Section 27.2.5).
It should be noted that log shipping is asynchronous, i.e., the WAL
records are shipped after transaction commit. As a result, there is a
window for data loss should the primary server suffer a catastrophic
failure; transactions not yet shipped will be lost. The size of the
data loss window in file-based log shipping can be limited by use of the
archive_timeout
parameter, which can be set as low
as a few seconds. However such a low setting will
substantially increase the bandwidth required for file shipping.
Streaming replication (see Section 27.2.5)
allows a much smaller window of data loss.
Recovery performance is sufficiently good that the standby will typically be only moments away from full availability once it has been activated. As a result, this is called a warm standby configuration which offers high availability. Restoring a server from an archived base backup and rollforward will take considerably longer, so that technique only offers a solution for disaster recovery, not high availability. A standby server can also be used for read-only queries, in which case it is called a Hot Standby server. See Section 27.4 for more information.
It is usually wise to create the primary and standby servers so that they are as similar as possible, at least from the perspective of the database server. In particular, the path names associated with tablespaces will be passed across unmodified, so both primary and standby servers must have the same mount paths for tablespaces if that feature is used. Keep in mind that if CREATE TABLESPACE is executed on the primary, any new mount point needed for it must be created on the primary and all standby servers before the command is executed. Hardware need not be exactly the same, but experience shows that maintaining two identical systems is easier than maintaining two dissimilar ones over the lifetime of the application and system. In any case the hardware architecture must be the same — shipping from, say, a 32-bit to a 64-bit system will not work.
In general, log shipping between servers running different major PostgreSQL release levels is not possible. It is the policy of the PostgreSQL Global Development Group not to make changes to disk formats during minor release upgrades, so it is likely that running different minor release levels on primary and standby servers will work successfully. However, no formal support for that is offered and you are advised to keep primary and standby servers at the same release level as much as possible. When updating to a new minor release, the safest policy is to update the standby servers first — a new minor release is more likely to be able to read WAL files from a previous minor release than vice versa.
A server enters standby mode if a
standby.signal
file exists in the data directory when the server is started.
In standby mode, the server continuously applies WAL received from the
primary server. The standby server can read WAL from a WAL archive
(see restore_command) or directly from the primary
over a TCP connection (streaming replication). The standby server will
also attempt to restore any WAL found in the standby cluster's
pg_wal
directory. That typically happens after a server
restart, when the standby replays again WAL that was streamed from the
primary before the restart, but you can also manually copy files to
pg_wal
at any time to have them replayed.
At startup, the standby begins by restoring all WAL available in the
archive location, calling restore_command
. Once it
reaches the end of WAL available there and restore_command
fails, it tries to restore any WAL available in the pg_wal
directory.
If that fails, and streaming replication has been configured, the
standby tries to connect to the primary server and start streaming WAL
from the last valid record found in archive or pg_wal
. If that fails
or streaming replication is not configured, or if the connection is
later disconnected, the standby goes back to step 1 and tries to
restore the file from the archive again. This loop of retries from the
archive, pg_wal
, and via streaming replication goes on until the server
is stopped or is promoted.
Standby mode is exited and the server switches to normal operation
when pg_ctl promote
is run,
pg_promote()
is called, or a trigger file is found
(promote_trigger_file
). Before failover,
any WAL immediately available in the archive or in pg_wal
will be
restored, but no attempt is made to connect to the primary.
Set up continuous archiving on the primary to an archive directory accessible from the standby, as described in Section 26.3. The archive location should be accessible from the standby even when the primary is down, i.e., it should reside on the standby server itself or another trusted server, not on the primary server.
If you want to use streaming replication, set up authentication on the
primary server to allow replication connections from the standby
server(s); that is, create a role and provide a suitable entry or
entries in pg_hba.conf
with the database field set to
replication
. Also ensure max_wal_senders
is set
to a sufficiently large value in the configuration file of the primary
server. If replication slots will be used,
ensure that max_replication_slots
is set sufficiently
high as well.
Take a base backup as described in Section 26.3.2 to bootstrap the standby server.
To set up the standby server, restore the base backup taken from primary
server (see Section 26.3.4). Create a file
standby.signal
in the standby's cluster data
directory. Set restore_command to a simple command to copy files from
the WAL archive. If you plan to have multiple standby servers for high
availability purposes, make sure that recovery_target_timeline
is set to
latest
(the default), to make the standby server follow the timeline change
that occurs at failover to another standby.
restore_command should return immediately if the file does not exist; the server will retry the command again if necessary.
If you want to use streaming replication, fill in primary_conninfo with a libpq connection string, including the host name (or IP address) and any additional details needed to connect to the primary server. If the primary needs a password for authentication, the password needs to be specified in primary_conninfo as well.
If you're setting up the standby server for high availability purposes, set up WAL archiving, connections and authentication like the primary server, because the standby server will work as a primary server after failover.
If you're using a WAL archive, its size can be minimized using the archive_cleanup_command parameter to remove files that are no
longer required by the standby server.
The pg_archivecleanup utility is designed specifically to
be used with archive_cleanup_command
in typical single-standby
configurations, see pg_archivecleanup.
Note however, that if you're using the archive for backup purposes, you
need to retain files needed to recover from at least the latest base
backup, even if they're no longer needed by the standby.
A simple example of configuration is:
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000''' restore_command = 'cp /path/to/archive/%f %p' archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
You can have any number of standby servers, but if you use streaming
replication, make sure you set max_wal_senders
high enough in
the primary to allow them to be connected simultaneously.
Streaming replication allows a standby server to stay more up-to-date than is possible with file-based log shipping. The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled.
Streaming replication is asynchronous by default
(see Section 27.2.8), in which case there is
a small delay between committing a transaction in the primary and the
changes becoming visible in the standby. This delay is however much
smaller than with file-based log shipping, typically under one second
assuming the standby is powerful enough to keep up with the load. With
streaming replication, archive_timeout
is not required to
reduce the data loss window.
If you use streaming replication without file-based continuous
archiving, the server might recycle old WAL segments before the standby
has received them. If this occurs, the standby will need to be
reinitialized from a new base backup. You can avoid this by setting
wal_keep_size
to a value large enough to ensure that
WAL segments are not recycled too early, or by configuring a replication
slot for the standby. If you set up a WAL archive that's accessible from
the standby, these solutions are not required, since the standby can
always use the archive to catch up provided it retains enough segments.
To use streaming replication, set up a file-based log-shipping standby
server as described in Section 27.2. The step that
turns a file-based log-shipping standby into streaming replication
standby is setting the primary_conninfo
setting
to point to the primary server. Set
listen_addresses and authentication options
(see pg_hba.conf
) on the primary so that the standby server
can connect to the replication
pseudo-database on the primary
server (see Section 27.2.5.1).
On systems that support the keepalive socket option, setting tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count helps the primary promptly notice a broken connection.
Set the maximum number of concurrent connections from the standby servers (see max_wal_senders for details).
When the standby is started and primary_conninfo
is set
correctly, the standby will connect to the primary after replaying all
WAL files available in the archive. If the connection is established
successfully, you will see a walreceiver
in the standby, and
a corresponding walsender
process in the primary.
It is very important that the access privileges for replication be set up
so that only trusted users can read the WAL stream, because it is
easy to extract privileged information from it. Standby servers must
authenticate to the primary as an account that has the
REPLICATION
privilege or a superuser. It is
recommended to create a dedicated user account with
REPLICATION
and LOGIN
privileges for replication. While REPLICATION
privilege gives very high permissions, it does not allow the user to
modify any data on the primary system, which the
SUPERUSER
privilege does.
Client authentication for replication is controlled by a
pg_hba.conf
record specifying replication
in the
database
field. For example, if the standby is running on
host IP 192.168.1.100
and the account name for replication
is foo
, the administrator can add the following line to the
pg_hba.conf
file on the primary:
# Allow the user "foo" from host 192.168.1.100 to connect to the primary # as a replication standby if the user's password is correctly supplied. # # TYPE DATABASE USER ADDRESS METHOD host replication foo 192.168.1.100/32 md5
The host name and port number of the primary, connection user name,
and password are specified in the primary_conninfo.
The password can also be set in the ~/.pgpass
file on the
standby (specify replication
in the database
field).
For example, if the primary is running on host IP 192.168.1.50
,
port 5432
, the account name for replication is
foo
, and the password is foopass
, the administrator
can add the following line to the postgresql.conf
file on the
standby:
# The standby connects to the primary that is running on host 192.168.1.50 # and port 5432 as the user "foo" whose password is "foopass". primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
An important health indicator of streaming replication is the amount
of WAL records generated in the primary, but not yet applied in the
standby. You can calculate this lag by comparing the current WAL write
location on the primary with the last WAL location received by the
standby. These locations can be retrieved using
pg_current_wal_lsn
on the primary and
pg_last_wal_receive_lsn
on the standby,
respectively (see Table 9.87 and
Table 9.88 for details).
The last WAL receive location in the standby is also displayed in the
process status of the WAL receiver process, displayed using the
ps
command (see Section 28.1 for details).
You can retrieve a list of WAL sender processes via the
pg_stat_replication
view. Large differences between
pg_current_wal_lsn
and the view's sent_lsn
field
might indicate that the primary server is under heavy load, while
differences between sent_lsn
and
pg_last_wal_receive_lsn
on the standby might indicate
network delay, or that the standby is under heavy load.
On a hot standby, the status of the WAL receiver process can be retrieved
via the
pg_stat_wal_receiver
view. A large
difference between pg_last_wal_replay_lsn
and the
view's flushed_lsn
indicates that WAL is being
received faster than it can be replayed.
Replication slots provide an automated way to ensure that the primary does not remove WAL segments until they have been received by all standbys, and that the primary does not remove rows which could cause a recovery conflict even when the standby is disconnected.
In lieu of using replication slots, it is possible to prevent the removal
of old WAL segments using wal_keep_size, or by
storing the segments in an archive using
archive_command.
However, these methods often result in retaining more WAL segments than
required, whereas replication slots retain only the number of segments
known to be needed. On the other hand, replication slots can retain so
many WAL segments that they fill up the space allocated
for pg_wal
;
max_slot_wal_keep_size limits the size of WAL files
retained by replication slots.
Similarly, hot_standby_feedback and vacuum_defer_cleanup_age provide protection against relevant rows being removed by vacuum, but the former provides no protection during any time period when the standby is not connected, and the latter often needs to be set to a high value to provide adequate protection. Replication slots overcome these disadvantages.
Each replication slot has a name, which can contain lower-case letters, numbers, and the underscore character.
Existing replication slots and their state can be seen in the
pg_replication_slots
view.
Slots can be created and dropped either via the streaming replication protocol (see Section 53.4) or via SQL functions (see Section 9.27.6).
You can create a replication slot like this:
postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot'); slot_name | lsn -------------+----- node_a_slot | postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots; slot_name | slot_type | active -------------+-----------+-------- node_a_slot | physical | f (1 row)
To configure the standby to use this slot, primary_slot_name
should be configured on the standby. Here is a simple example:
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' primary_slot_name = 'node_a_slot'
The cascading replication feature allows a standby server to accept replication connections and stream WAL records to other standbys, acting as a relay. This can be used to reduce the number of direct connections to the primary and also to minimize inter-site bandwidth overheads.
A standby acting as both a receiver and a sender is known as a cascading standby. Standbys that are more directly connected to the primary are known as upstream servers, while those standby servers further away are downstream servers. Cascading replication does not place limits on the number or arrangement of downstream servers, though each standby connects to only one upstream server which eventually links to a single primary server.
A cascading standby sends not only WAL records received from the primary but also those restored from the archive. So even if the replication connection in some upstream connection is terminated, streaming replication continues downstream for as long as new WAL records are available.
Cascading replication is currently asynchronous. Synchronous replication (see Section 27.2.8) settings have no effect on cascading replication at present.
Hot Standby feedback propagates upstream, whatever the cascaded arrangement.
If an upstream standby server is promoted to become the new primary, downstream
servers will continue to stream from the new primary if
recovery_target_timeline
is set to 'latest'
(the default).
To use cascading replication, set up the cascading standby so that it can
accept replication connections (that is, set
max_wal_senders and hot_standby,
and configure
host-based authentication).
You will also need to set primary_conninfo
in the downstream
standby to point to the cascading standby.
PostgreSQL streaming replication is asynchronous by default. If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover.
Synchronous replication offers the ability to confirm that all changes
made by a transaction have been transferred to one or more synchronous
standby servers. This extends that standard level of durability
offered by a transaction commit. This level of protection is referred
to as 2-safe replication in computer science theory, and group-1-safe
(group-safe and 1-safe) when synchronous_commit
is set to
remote_write
.
When requesting synchronous replication, each commit of a write transaction will wait until confirmation is received that the commit has been written to the write-ahead log on disk of both the primary and standby server. The only possibility that data can be lost is if both the primary and the standby suffer crashes at the same time. This can provide a much higher level of durability, though only if the sysadmin is cautious about the placement and management of the two servers. Waiting for confirmation increases the user's confidence that the changes will not be lost in the event of server crashes but it also necessarily increases the response time for the requesting transaction. The minimum wait time is the round-trip time between primary and standby.
Read-only transactions and transaction rollbacks need not wait for replies from standby servers. Subtransaction commits do not wait for responses from standby servers, only top-level commits. Long running actions such as data loading or index building do not wait until the very final commit message. All two-phase commit actions require commit waits, including both prepare and commit.
A synchronous standby can be a physical replication standby or a logical
replication subscriber. It can also be any other physical or logical WAL
replication stream consumer that knows how to send the appropriate
feedback messages. Besides the built-in physical and logical replication
systems, this includes special programs such
as pg_receivewal
and pg_recvlogical
as well as some third-party replication systems and custom programs.
Check the respective documentation for details on synchronous replication
support.
Once streaming replication has been configured, configuring synchronous
replication requires only one additional configuration step:
synchronous_standby_names must be set to
a non-empty value. synchronous_commit
must also be set to
on
, but since this is the default value, typically no change is
required. (See Section 20.5.1 and
Section 20.6.2.)
This configuration will cause each commit to wait for
confirmation that the standby has written the commit record to durable
storage.
synchronous_commit
can be set by individual
users, so it can be configured in the configuration file, for particular
users or databases, or dynamically by applications, in order to control
the durability guarantee on a per-transaction basis.
After a commit record has been written to disk on the primary, the
WAL record is then sent to the standby. The standby sends reply
messages each time a new batch of WAL data is written to disk, unless
wal_receiver_status_interval
is set to zero on the standby.
In the case that synchronous_commit
is set to
remote_apply
, the standby sends reply messages when the commit
record is replayed, making the transaction visible.
If the standby is chosen as a synchronous standby, according to the setting
of synchronous_standby_names
on the primary, the reply
messages from that standby will be considered along with those from other
synchronous standbys to decide when to release transactions waiting for
confirmation that the commit record has been received. These parameters
allow the administrator to specify which standby servers should be
synchronous standbys. Note that the configuration of synchronous
replication is mainly on the primary. Named standbys must be directly
connected to the primary; the primary knows nothing about downstream
standby servers using cascaded replication.
Setting synchronous_commit
to remote_write
will
cause each commit to wait for confirmation that the standby has received
the commit record and written it out to its own operating system, but not
for the data to be flushed to disk on the standby. This
setting provides a weaker guarantee of durability than on
does: the standby could lose the data in the event of an operating system
crash, though not a PostgreSQL crash.
However, it's a useful setting in practice
because it can decrease the response time for the transaction.
Data loss could only occur if both the primary and the standby crash and
the database of the primary gets corrupted at the same time.
Setting synchronous_commit
to remote_apply
will
cause each commit to wait until the current synchronous standbys report
that they have replayed the transaction, making it visible to user
queries. In simple cases, this allows for load balancing with causal
consistency.
Users will stop waiting if a fast shutdown is requested. However, as when using asynchronous replication, the server will not fully shutdown until all outstanding WAL records are transferred to the currently connected standby servers.
Synchronous replication supports one or more synchronous standby servers;
transactions will wait until all the standby servers which are considered
as synchronous confirm receipt of their data. The number of synchronous
standbys that transactions must wait for replies from is specified in
synchronous_standby_names
. This parameter also specifies
a list of standby names and the method (FIRST
and
ANY
) to choose synchronous standbys from the listed ones.
The method FIRST
specifies a priority-based synchronous
replication and makes transaction commits wait until their WAL records are
replicated to the requested number of synchronous standbys chosen based on
their priorities. The standbys whose names appear earlier in the list are
given higher priority and will be considered as synchronous. Other standby
servers appearing later in this list represent potential synchronous
standbys. If any of the current synchronous standbys disconnects for
whatever reason, it will be replaced immediately with the
next-highest-priority standby.
An example of synchronous_standby_names
for
a priority-based multiple synchronous standbys is:
synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
In this example, if four standby servers s1
, s2
,
s3
and s4
are running, the two standbys
s1
and s2
will be chosen as synchronous standbys
because their names appear early in the list of standby names.
s3
is a potential synchronous standby and will take over
the role of synchronous standby when either of s1
or
s2
fails. s4
is an asynchronous standby since
its name is not in the list.
The method ANY
specifies a quorum-based synchronous
replication and makes transaction commits wait until their WAL records
are replicated to at least the requested number of
synchronous standbys in the list.
An example of synchronous_standby_names
for
a quorum-based multiple synchronous standbys is:
synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
In this example, if four standby servers s1
, s2
,
s3
and s4
are running, transaction commits will
wait for replies from at least any two standbys of s1
,
s2
and s3
. s4
is an asynchronous
standby since its name is not in the list.
The synchronous states of standby servers can be viewed using
the pg_stat_replication
view.
Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilize system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will reduce performance for database applications because of increased response times and higher contention.
PostgreSQL allows the application developer to specify the durability level required via replication. This can be specified for the system overall, though it can also be specified for specific users or connections, or even individual transactions.
For example, an application workload might consist of: 10% of changes are important customer details, while 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users.
With synchronous replication options specified at the application level (on the primary) we can offer synchronous replication for the most important changes, without slowing down the bulk of the total workload. Application level options are an important and practical tool for allowing the benefits of synchronous replication for high performance applications.
You should consider that the network bandwidth must be higher than the rate of generation of WAL data.
synchronous_standby_names
specifies the number and
names of synchronous standbys that transaction commits made when
synchronous_commit
is set to on
,
remote_apply
or remote_write
will wait for
responses from. Such transaction commits may never be completed
if any one of the synchronous standbys should crash.
The best solution for high availability is to ensure you keep as many
synchronous standbys as requested. This can be achieved by naming multiple
potential synchronous standbys using synchronous_standby_names
.
In a priority-based synchronous replication, the standbys whose names appear earlier in the list will be used as synchronous standbys. Standbys listed after these will take over the role of synchronous standby if one of current ones should fail.
In a quorum-based synchronous replication, all the standbys appearing in the list will be used as candidates for synchronous standbys. Even if one of them should fail, the other standbys will keep performing the role of candidates of synchronous standby.
When a standby first attaches to the primary, it will not yet be properly
synchronized. This is described as catchup
mode. Once
the lag between standby and primary reaches zero for the first time
we move to real-time streaming
state.
The catch-up duration may be long immediately after the standby has
been created. If the standby is shut down, then the catch-up period
will increase according to the length of time the standby has been down.
The standby is only able to become a synchronous standby
once it has reached streaming
state.
This state can be viewed using
the pg_stat_replication
view.
If primary restarts while commits are waiting for acknowledgment, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgment of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys.
If you really cannot keep as many synchronous standbys as requested
then you should decrease the number of synchronous standbys that
transaction commits must wait for responses from
in synchronous_standby_names
(or disable it) and
reload the configuration file on the primary server.
If the primary is isolated from remaining standby servers you should fail over to the best candidate of those other remaining standby servers.
If you need to re-create a standby server while transactions are
waiting, make sure that the commands pg_start_backup() and
pg_stop_backup() are run in a session with
synchronous_commit
= off
, otherwise those
requests will wait forever for the standby to appear.
When continuous WAL archiving is used in a standby, there are two
different scenarios: the WAL archive can be shared between the primary
and the standby, or the standby can have its own WAL archive. When
the standby has its own WAL archive, set archive_mode
to always
, and the standby will call the archive
command for every WAL segment it receives, whether it's by restoring
from the archive or by streaming replication. The shared archive can
be handled similarly, but the archive_command
must
test if the file being archived exists already, and if the existing file
has identical contents. This requires more care in the
archive_command
, as it must
be careful to not overwrite an existing file with different contents,
but return success if the exactly same file is archived twice. And
all that must be done free of race conditions, if two servers attempt
to archive the same file at the same time.
If archive_mode
is set to on
, the
archiver is not enabled during recovery or standby mode. If the standby
server is promoted, it will start archiving after the promotion, but
will not archive any WAL or timeline history files that
it did not generate itself. To get a complete
series of WAL files in the archive, you must ensure that all WAL is
archived, before it reaches the standby. This is inherently true with
file-based log shipping, as the standby can only restore files that
are found in the archive, but not if streaming replication is enabled.
When a server is not in recovery mode, there is no difference between
on
and always
modes.
If the primary server fails then the standby server should begin failover procedures.
If the standby server fails then no failover need take place. If the standby server can be restarted, even some time later, then the recovery process can also be restarted immediately, taking advantage of restartable recovery. If the standby server cannot be restarted, then a full new standby server instance should be created.
If the primary server fails and the standby server becomes the new primary, and then the old primary restarts, you must have a mechanism for informing the old primary that it is no longer the primary. This is sometimes known as STONITH (Shoot The Other Node In The Head), which is necessary to avoid situations where both systems think they are the primary, which will lead to confusion and ultimately data loss.
Many failover systems use just two systems, the primary and the standby, connected by some kind of heartbeat mechanism to continually verify the connectivity between the two and the viability of the primary. It is also possible to use a third system (called a witness server) to prevent some cases of inappropriate failover, but the additional complexity might not be worthwhile unless it is set up with sufficient care and rigorous testing.
PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration.
Once failover to the standby occurs, there is only a single server in operation. This is known as a degenerate state. The former standby is now the primary, but the former primary is down and might stay down. To return to normal operation, a standby server must be recreated, either on the former primary system when it comes up, or on a third, possibly new, system. The pg_rewind utility can be used to speed up this process on large clusters. Once complete, the primary and standby can be considered to have switched roles. Some people choose to use a third server to provide backup for the new primary until the new standby server is recreated, though clearly this complicates the system configuration and operational processes.
So, switching from primary to standby server can be fast but requires some time to re-prepare the failover cluster. Regular switching from primary to standby is useful, since it allows regular downtime on each system for maintenance. This also serves as a test of the failover mechanism to ensure that it will really work when you need it. Written administration procedures are advised.
To trigger failover of a log-shipping standby server, run
pg_ctl promote
, call pg_promote()
,
or create a trigger file with the file name and path specified by the
promote_trigger_file
. If you're planning to use
pg_ctl promote
or to call
pg_promote()
to fail over,
promote_trigger_file
is not required. If you're
setting up the reporting servers that are only used to offload read-only
queries from the primary, not for high availability purposes, you don't
need to promote it.
Hot Standby is the term used to describe the ability to connect to the server and run read-only queries while the server is in archive recovery or standby mode. This is useful both for replication purposes and for restoring a backup to a desired state with great precision. The term Hot Standby also refers to the ability of the server to move from recovery through to normal operation while users continue running queries and/or keep their connections open.
Running queries in hot standby mode is similar to normal query operation, though there are several usage and administrative differences explained below.
When the hot_standby parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written.
The data on the standby takes some time to arrive from the primary server so there will be a measurable delay between primary and standby. Running the same query nearly simultaneously on both primary and standby might therefore return differing results. We say that data on the standby is eventually consistent with the primary. Once the commit record for a transaction is replayed on the standby, the changes made by that transaction will be visible to any new snapshots taken on the standby. Snapshots may be taken at the start of each query or at the start of each transaction, depending on the current transaction isolation level. For more details, see Section 13.2.
Transactions started during hot standby may issue the following commands:
Query access: SELECT
, COPY TO
Cursor commands: DECLARE
, FETCH
, CLOSE
Settings: SHOW
, SET
, RESET
Transaction management commands:
BEGIN
, END
, ABORT
, START TRANSACTION
SAVEPOINT
, RELEASE
, ROLLBACK TO SAVEPOINT
EXCEPTION
blocks and other internal subtransactions
LOCK TABLE
, though only when explicitly in one of these modes:
ACCESS SHARE
, ROW SHARE
or ROW EXCLUSIVE
.
Plans and resources: PREPARE
, EXECUTE
,
DEALLOCATE
, DISCARD
Plugins and extensions: LOAD
UNLISTEN
Transactions started during hot standby will never be assigned a transaction ID and cannot write to the system write-ahead log. Therefore, the following actions will produce error messages:
Data Manipulation Language (DML): INSERT
,
UPDATE
, DELETE
, COPY FROM
,
TRUNCATE
.
Note that there are no allowed actions that result in a trigger
being executed during recovery. This restriction applies even to
temporary tables, because table rows cannot be read or written without
assigning a transaction ID, which is currently not possible in a
Hot Standby environment.
Data Definition Language (DDL): CREATE
,
DROP
, ALTER
, COMMENT
.
This restriction applies even to temporary tables, because carrying
out these operations would require updating the system catalog tables.
SELECT ... FOR SHARE | UPDATE
, because row locks cannot be
taken without updating the underlying data files.
Rules on SELECT
statements that generate DML commands.
LOCK
that explicitly requests a mode higher than ROW EXCLUSIVE MODE
.
LOCK
in short default form, since it requests ACCESS EXCLUSIVE MODE
.
Transaction management commands that explicitly set non-read-only state:
BEGIN READ WRITE
,
START TRANSACTION READ WRITE
SET TRANSACTION READ WRITE
,
SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
SET transaction_read_only = off
Two-phase commit commands: PREPARE TRANSACTION
,
COMMIT PREPARED
, ROLLBACK PREPARED
because even read-only transactions need to write WAL in the
prepare phase (the first phase of two phase commit).
Sequence updates: nextval()
, setval()
LISTEN
, NOTIFY
In normal operation, “read-only” transactions are allowed to
use LISTEN
and NOTIFY
,
so Hot Standby sessions operate under slightly tighter
restrictions than ordinary read-only sessions. It is possible that some
of these restrictions might be loosened in a future release.
During hot standby, the parameter transaction_read_only
is always
true and may not be changed. But as long as no attempt is made to modify
the database, connections during hot standby will act much like any other
database connection. If failover or switchover occurs, the database will
switch to normal processing mode. Sessions will remain connected while the
server changes mode. Once hot standby finishes, it will be possible to
initiate read-write transactions (even from a session begun during
hot standby).
Users can determine whether hot standby is currently active for their
session by issuing SHOW in_hot_standby
.
(In server versions before 14, the in_hot_standby
parameter did not exist; a workable substitute method for older servers
is SHOW transaction_read_only
.) In addition, a set of
functions (Table 9.88) allow users to
access information about the standby server. These allow you to write
programs that are aware of the current state of the database. These
can be used to monitor the progress of recovery, or to allow you to
write complex programs that restore the database to particular states.
The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O.
There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that queries might need to be canceled and, in some cases, sessions disconnected to resolve them. The user is provided with several ways to handle these conflicts. Conflict cases include:
Access Exclusive locks taken on the primary server, including both
explicit LOCK
commands and various DDL
actions, conflict with table accesses in standby queries.
Dropping a tablespace on the primary conflicts with standby queries using that tablespace for temporary work files.
Dropping a database on the primary conflicts with sessions connected to that database on the standby.
Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still “see” any of the rows to be removed.
Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page on the standby, whether or not the data to be removed is visible.
On the primary server, these cases simply result in waiting; and the user might choose to cancel either of the conflicting actions. However, on the standby there is no choice: the WAL-logged action already occurred on the primary so the standby must not fail to apply it. Furthermore, allowing WAL application to wait indefinitely may be very undesirable, because the standby's state will become increasingly far behind the primary's. Therefore, a mechanism is provided to forcibly cancel standby queries that conflict with to-be-applied WAL records.
An example of the problem situation is an administrator on the primary
server running DROP TABLE
on a table that is currently being
queried on the standby server. Clearly the standby query cannot continue
if the DROP TABLE
is applied on the standby. If this situation
occurred on the primary, the DROP TABLE
would wait until the
other query had finished. But when DROP TABLE
is run on the
primary, the primary doesn't have information about what queries are
running on the standby, so it will not wait for any such standby
queries. The WAL change records come through to the standby while the
standby query is still running, causing a conflict. The standby server
must either delay application of the WAL records (and everything after
them, too) or else cancel the conflicting query so that the DROP
TABLE
can be applied.
When a conflicting query is short, it's typically desirable to allow it to complete by delaying WAL application for a little bit; but a long delay in WAL application is usually not desirable. So the cancel mechanism has parameters, max_standby_archive_delay and max_standby_streaming_delay, that define the maximum allowed delay in WAL application. Conflicting queries will be canceled once it has taken longer than the relevant delay setting to apply any newly-received WAL data. There are two parameters so that different delay values can be specified for the case of reading WAL data from an archive (i.e., initial recovery from a base backup or “catching up” a standby server that has fallen far behind) versus reading WAL data via streaming replication.
In a standby server that exists primarily for high availability, it's best to set the delay parameters relatively short, so that the server cannot fall far behind the primary due to delays caused by standby queries. However, if the standby server is meant for executing long-running queries, then a high or even infinite delay value may be preferable. Keep in mind however that a long-running query could cause other sessions on the standby server to not see recent changes on the primary, if it delays application of WAL records.
Once the delay specified by max_standby_archive_delay
or
max_standby_streaming_delay
has been exceeded, conflicting
queries will be canceled. This usually results just in a cancellation
error, although in the case of replaying a DROP DATABASE
the entire conflicting session will be terminated. Also, if the conflict
is over a lock held by an idle transaction, the conflicting session is
terminated (this behavior might change in the future).
Canceled queries may be retried immediately (after beginning a new transaction, of course). Since query cancellation depends on the nature of the WAL records being replayed, a query that was canceled may well succeed if it is executed again.
Keep in mind that the delay parameters are compared to the elapsed time since the WAL data was received by the standby server. Thus, the grace period allowed to any one query on the standby is never more than the delay parameter, and could be considerably less if the standby has already fallen behind as a result of waiting for previous queries to complete, or as a result of being unable to keep up with a heavy update load.
The most common reason for conflict between standby queries and WAL replay is “early cleanup”. Normally, PostgreSQL allows cleanup of old row versions when there are no transactions that need to see them to ensure correct visibility of data according to MVCC rules. However, this rule can only be applied for transactions executing on the primary. So it is possible that cleanup on the primary will remove row versions that are still visible to a transaction on the standby.
Experienced users should note that both row version cleanup and row version
freezing will potentially conflict with standby queries. Running a manual
VACUUM FREEZE
is likely to cause conflicts even on tables with
no updated or deleted rows.
Users should be clear that tables that are regularly and heavily updated
on the primary server will quickly cause cancellation of longer running
queries on the standby. In such cases the setting of a finite value for
max_standby_archive_delay
or
max_standby_streaming_delay
can be considered similar to
setting statement_timeout
.
Remedial possibilities exist if the number of standby-query cancellations
is found to be unacceptable. The first option is to set the parameter
hot_standby_feedback
, which prevents VACUUM
from
removing recently-dead rows and so cleanup conflicts do not occur.
If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
directly on the primary server, and you are still getting the benefit of
off-loading execution onto the standby.
If standby servers connect and disconnect frequently, you
might want to make adjustments to handle the period when
hot_standby_feedback
feedback is not being provided.
For example, consider increasing max_standby_archive_delay
so that queries are not rapidly canceled by conflicts in WAL archive
files during disconnected periods. You should also consider increasing
max_standby_streaming_delay
to avoid rapid cancellations
by newly-arrived streaming WAL entries after reconnection.
Another option is to increase vacuum_defer_cleanup_age
on the primary server, so that dead rows will not be cleaned up as quickly
as they normally would be. This will allow more time for queries to
execute before they are canceled on the standby, without having to set
a high max_standby_streaming_delay
. However it is
difficult to guarantee any specific execution-time window with this
approach, since vacuum_defer_cleanup_age
is measured in
transactions executed on the primary server.
The number of query cancels and the reason for them can be viewed using
the pg_stat_database_conflicts
system view on the standby
server. The pg_stat_database
system view also contains
summary information.
Users can control whether a log message is produced when WAL replay is waiting
longer than deadlock_timeout
for conflicts. This
is controlled by the log_recovery_conflict_waits parameter.
If hot_standby
is on
in postgresql.conf
(the default value) and there is a
standby.signal
file present, the server will run in Hot Standby mode.
However, it may take some time for Hot Standby connections to be allowed,
because the server will not accept connections until it has completed
sufficient recovery to provide a consistent state against which queries
can run. During this period,
clients that attempt to connect will be refused with an error message.
To confirm the server has come up, either loop trying to connect from
the application, or look for these messages in the server logs:
LOG: entering standby mode ... then some time later ... LOG: consistent recovery state reached LOG: database system is ready to accept read-only connections
Consistency information is recorded once per checkpoint on the primary.
It is not possible to enable hot standby when reading WAL
written during a period when wal_level
was not set to
replica
or logical
on the primary. Reaching
a consistent state can also be delayed in the presence of both of these
conditions:
A write transaction has more than 64 subtransactions
Very long-lived write transactions
If you are running file-based log shipping ("warm standby"), you might need
to wait until the next WAL file arrives, which could be as long as the
archive_timeout
setting on the primary.
The settings of some parameters determine the size of shared memory for tracking transaction IDs, locks, and prepared transactions. These shared memory structures must be no smaller on a standby than on the primary in order to ensure that the standby does not run out of shared memory during recovery. For example, if the primary had used a prepared transaction but the standby had not allocated any shared memory for tracking prepared transactions, then recovery could not continue until the standby's configuration is changed. The parameters affected are:
max_connections
max_prepared_transactions
max_locks_per_transaction
max_wal_senders
max_worker_processes
The easiest way to ensure this does not become a problem is to have these parameters set on the standbys to values equal to or greater than on the primary. Therefore, if you want to increase these values, you should do so on all standby servers first, before applying the changes to the primary server. Conversely, if you want to decrease these values, you should do so on the primary server first, before applying the changes to all standby servers. Keep in mind that when a standby is promoted, it becomes the new reference for the required parameter settings for the standbys that follow it. Therefore, to avoid this becoming a problem during a switchover or failover, it is recommended to keep these settings the same on all standby servers.
The WAL tracks changes to these parameters on the primary. If a hot standby processes WAL that indicates that the current value on the primary is higher than its own value, it will log a warning and pause recovery, for example:
WARNING: hot standby is not possible because of insufficient parameter settings DETAIL: max_connections = 80 is a lower setting than on the primary server, where its value was 100. LOG: recovery has paused DETAIL: If recovery is unpaused, the server will shut down. HINT: You can then restart the server after making the necessary configuration changes.
At that point, the settings on the standby need to be updated and the instance restarted before recovery can continue. If the standby is not a hot standby, then when it encounters the incompatible parameter change, it will shut down immediately without pausing, since there is then no value in keeping it up.
It is important that the administrator select appropriate settings for max_standby_archive_delay and max_standby_streaming_delay. The best choices vary depending on business priorities. For example if the server is primarily tasked as a High Availability server, then you will want low delay settings, perhaps even zero, though that is a very aggressive setting. If the standby server is tasked as an additional server for decision support queries then it might be acceptable to set the maximum delay values to many hours, or even -1 which means wait forever for queries to complete.
Transaction status "hint bits" written on the primary are not WAL-logged, so data on the standby will likely re-write the hints again on the standby. Thus, the standby server will still perform disk writes even though all users are read-only; no changes occur to the data values themselves. Users will still write large sort temporary files and re-generate relcache info files, so no part of the database is truly read-only during hot standby mode. Note also that writes to remote databases using dblink module, and other operations outside the database using PL functions will still be possible, even though the transaction is read-only locally.
The following types of administration commands are not accepted during recovery mode:
Data Definition Language (DDL): e.g., CREATE INDEX
Privilege and Ownership: GRANT
, REVOKE
,
REASSIGN
Maintenance commands: ANALYZE
, VACUUM
,
CLUSTER
, REINDEX
Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary.
As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby.
pg_cancel_backend()
and pg_terminate_backend()
will work on user backends,
but not the Startup process, which performs
recovery. pg_stat_activity
does not show
recovering transactions as active. As a result,
pg_prepared_xacts
is always empty during
recovery. If you wish to resolve in-doubt prepared transactions, view
pg_prepared_xacts
on the primary and issue commands to
resolve transactions there or resolve them after the end of recovery.
pg_locks
will show locks held by backends,
as normal. pg_locks
also shows
a virtual transaction managed by the Startup process that owns all
AccessExclusiveLocks
held by transactions being replayed by recovery.
Note that the Startup process does not acquire locks to
make database changes, and thus locks other than AccessExclusiveLocks
do not show in pg_locks
for the Startup
process; they are just presumed to exist.
The Nagios plugin check_pgsql will work, because the simple information it checks for exists. The check_postgres monitoring script will also work, though some reported values could give different or confusing results. For example, last vacuum time will not be maintained, since no vacuum occurs on the standby. Vacuums running on the primary do still send their changes to the standby.
WAL file control commands will not work during recovery,
e.g., pg_start_backup
, pg_switch_wal
etc.
Dynamically loadable modules work, including pg_stat_statements
.
Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks are never WAL logged, so it is impossible for an advisory lock on either the primary or the standby to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have it initiate a similar advisory lock on the standby. Advisory locks relate only to the server on which they are acquired.
Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at all, though they will run happily on the primary server as long as the changes are not sent to standby servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any system that requires additional database writes or relies on the use of triggers.
New OIDs cannot be assigned, though some UUID generators may still work as long as they do not rely on writing new status to the database.
Currently, temporary table creation is not allowed during read-only transactions, so in some cases existing scripts will not run correctly. This restriction might be relaxed in a later release. This is both an SQL standard compliance issue and a technical issue.
DROP TABLESPACE
can only succeed if the tablespace is empty.
Some standby users may be actively using the tablespace via their
temp_tablespaces
parameter. If there are temporary files in the
tablespace, all active queries are canceled to ensure that temporary
files are removed, so the tablespace can be removed and WAL replay
can continue.
Running DROP DATABASE
or ALTER DATABASE ... SET
TABLESPACE
on the primary
will generate a WAL entry that will cause all users connected to that
database on the standby to be forcibly disconnected. This action occurs
immediately, whatever the setting of
max_standby_streaming_delay
. Note that
ALTER DATABASE ... RENAME
does not disconnect users, which
in most cases will go unnoticed, though might in some cases cause a
program confusion if it depends in some way upon database name.
In normal (non-recovery) mode, if you issue DROP USER
or DROP ROLE
for a role with login capability while that user is still connected then
nothing happens to the connected user — they remain connected. The user cannot
reconnect however. This behavior applies in recovery also, so a
DROP USER
on the primary does not disconnect that user on the standby.
The statistics collector is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at the start of recovery, so stats from primary and standby will differ; this is considered a feature, not a bug.
Autovacuum is not active during recovery. It will start normally at the end of recovery.
The checkpointer process and the background writer process are active during
recovery. The checkpointer process will perform restartpoints (similar to
checkpoints on the primary) and the background writer process will perform
normal block cleaning activities. This can include updates of the hint bit
information stored on the standby server.
The CHECKPOINT
command is accepted during recovery,
though it performs a restartpoint rather than a new checkpoint.
Various parameters have been mentioned above in Section 27.4.2 and Section 27.4.3.
On the primary, parameters wal_level and vacuum_defer_cleanup_age can be used. max_standby_archive_delay and max_standby_streaming_delay have no effect if set on the primary.
On the standby, parameters hot_standby, max_standby_archive_delay and max_standby_streaming_delay can be used. vacuum_defer_cleanup_age has no effect as long as the server remains in standby mode, though it will become relevant if the standby becomes primary.
There are several limitations of Hot Standby. These can and probably will be fixed in future releases:
Full knowledge of running transactions is required before snapshots can be taken. Transactions that use large numbers of subtransactions (currently greater than 64) will delay the start of read-only connections until the completion of the longest running write transaction. If this situation occurs, explanatory messages will be sent to the server log.
Valid starting points for standby queries are generated at each checkpoint on the primary. If the standby is shut down while the primary is in a shutdown state, it might not be possible to re-enter Hot Standby until the primary is started up, so that it generates further starting points in the WAL logs. This situation isn't a problem in the most common situations where it might happen. Generally, if the primary is shut down and not available anymore, that's likely due to a serious failure that requires the standby being converted to operate as the new primary anyway. And in situations where the primary is being intentionally taken down, coordinating to make sure the standby becomes the new primary smoothly is also standard procedure.
At the end of recovery, AccessExclusiveLocks
held by prepared transactions
will require twice the normal number of lock table entries. If you plan
on running either a large number of concurrent prepared transactions
that normally take AccessExclusiveLocks
, or you plan on having one
large transaction that takes many AccessExclusiveLocks
, you are
advised to select a larger value of max_locks_per_transaction
,
perhaps as much as twice the value of the parameter on
the primary server. You need not consider this at all if
your setting of max_prepared_transactions
is 0.
The Serializable transaction isolation level is not yet available in hot standby. (See Section 13.2.3 and Section 13.4.1 for details.) An attempt to set a transaction to the serializable isolation level in hot standby mode will generate an error.
Table of Contents
pg_stat_activity
pg_stat_replication
pg_stat_replication_slots
pg_stat_wal_receiver
pg_stat_subscription
pg_stat_ssl
pg_stat_gssapi
pg_stat_archiver
pg_stat_bgwriter
pg_stat_wal
pg_stat_database
pg_stat_database_conflicts
pg_stat_all_tables
pg_stat_all_indexes
pg_statio_all_tables
pg_statio_all_indexes
pg_statio_all_sequences
pg_stat_user_functions
pg_stat_slru
A database administrator frequently wonders, “What is the system doing right now?” This chapter discusses how to find that out.
Several tools are available for monitoring database activity and
analyzing performance. Most of this chapter is devoted to describing
PostgreSQL's statistics collector,
but one should not neglect regular Unix monitoring programs such as
ps
, top
, iostat
, and vmstat
.
Also, once one has identified a
poorly-performing query, further investigation might be needed using
PostgreSQL's EXPLAIN
command.
Section 14.1 discusses EXPLAIN
and other methods for understanding the behavior of an individual
query.
On most Unix platforms, PostgreSQL modifies its
command title as reported by ps
, so that individual server
processes can readily be identified. A sample display is
$ ps auxww | grep ^postgres postgres 15551 0.0 0.1 57536 7132 pts/0 S 18:02 0:00 postgres -i postgres 15554 0.0 0.0 57536 1184 ? Ss 18:02 0:00 postgres: background writer postgres 15555 0.0 0.0 57536 916 ? Ss 18:02 0:00 postgres: checkpointer postgres 15556 0.0 0.0 57536 916 ? Ss 18:02 0:00 postgres: walwriter postgres 15557 0.0 0.0 58504 2244 ? Ss 18:02 0:00 postgres: autovacuum launcher postgres 15558 0.0 0.0 17512 1068 ? Ss 18:02 0:00 postgres: stats collector postgres 15582 0.0 0.0 58772 3080 ? Ss 18:04 0:00 postgres: joe runbug 127.0.0.1 idle postgres 15606 0.0 0.0 58772 3052 ? Ss 18:07 0:00 postgres: tgl regression [local] SELECT waiting postgres 15610 0.0 0.0 58772 3056 ? Ss 18:07 0:00 postgres: tgl regression [local] idle in transaction
(The appropriate invocation of ps
varies across different
platforms, as do the details of what is shown. This example is from a
recent Linux system.) The first process listed here is the
primary server process. The command arguments
shown for it are the same ones used when it was launched. The next five
processes are background worker processes automatically launched by the
primary process. (The “stats collector” process will not be present
if you have set the system not to start the statistics collector; likewise
the “autovacuum launcher” process can be disabled.)
Each of the remaining
processes is a server process handling one client connection. Each such
process sets its command line display in the form
postgres:user
database
host
activity
The user, database, and (client) host items remain the same for
the life of the client connection, but the activity indicator changes.
The activity can be idle
(i.e., waiting for a client command),
idle in transaction
(waiting for client inside a BEGIN
block),
or a command type name such as SELECT
. Also,
waiting
is appended if the server process is presently waiting
on a lock held by another session. In the above example we can infer
that process 15606 is waiting for process 15610 to complete its transaction
and thereby release some lock. (Process 15610 must be the blocker, because
there is no other active session. In more complicated cases it would be
necessary to look into the
pg_locks
system view to determine who is blocking whom.)
If cluster_name has been configured the
cluster name will also be shown in ps
output:
$ psql -c 'SHOW cluster_name' cluster_name -------------- server1 (1 row) $ ps aux|grep server1 postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: server1: background writer ...
If you have turned off update_process_title then the activity indicator is not updated; the process title is set only once when a new process is launched. On some platforms this saves a measurable amount of per-command overhead; on others it's insignificant.
Solaris requires special handling. You must
use /usr/ucb/ps
, rather than
/bin/ps
. You also must use two w
flags, not just one. In addition, your original invocation of the
postgres
command must have a shorter
ps
status display than that provided by each
server process. If you fail to do all three things, the ps
output for each server process will be the original postgres
command line.
pg_stat_activity
pg_stat_replication
pg_stat_replication_slots
pg_stat_wal_receiver
pg_stat_subscription
pg_stat_ssl
pg_stat_gssapi
pg_stat_archiver
pg_stat_bgwriter
pg_stat_wal
pg_stat_database
pg_stat_database_conflicts
pg_stat_all_tables
pg_stat_all_indexes
pg_statio_all_tables
pg_statio_all_indexes
pg_statio_all_sequences
pg_stat_user_functions
pg_stat_slru
PostgreSQL's statistics collector is a subsystem that supports collection and reporting of information about server activity. Presently, the collector can count accesses to tables and indexes in both disk-block and individual-row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table. It can also count calls to user-defined functions and the total time spent in each one.
PostgreSQL also supports reporting dynamic information about exactly what is going on in the system right now, such as the exact command currently being executed by other server processes, and which other connections exist in the system. This facility is independent of the collector process.
Since collection of statistics adds some overhead to query execution,
the system can be configured to collect or not collect information.
This is controlled by configuration parameters that are normally set in
postgresql.conf
. (See Chapter 20 for
details about setting configuration parameters.)
The parameter track_activities enables monitoring of the current command being executed by any server process.
The parameter track_counts controls whether statistics are collected about table and index accesses.
The parameter track_functions enables tracking of usage of user-defined functions.
The parameter track_io_timing enables monitoring of block read and write times.
The parameter track_wal_io_timing enables monitoring of WAL write times.
Normally these parameters are set in postgresql.conf
so
that they apply to all server processes, but it is possible to turn
them on or off in individual sessions using the SET command. (To prevent
ordinary users from hiding their activity from the administrator,
only superusers are allowed to change these parameters with
SET
.)
The statistics collector transmits the collected information to other
PostgreSQL processes through temporary files.
These files are stored in the directory named by the
stats_temp_directory parameter,
pg_stat_tmp
by default.
For better performance, stats_temp_directory
can be
pointed at a RAM-based file system, decreasing physical I/O requirements.
When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the pg_stat
subdirectory, so that
statistics can be retained across server restarts. When recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset.
Several predefined views, listed in Table 28.1, are available to show the current state of the system. There are also several other views, listed in Table 28.2, available to show the results of statistics collection. Alternatively, one can build custom views using the underlying statistics functions, as discussed in Section 28.2.22.
When using the statistics to monitor collected data, it is important
to realize that the information does not update instantaneously.
Each individual server process transmits new statistical counts to
the collector just before going idle; so a query or transaction still in
progress does not affect the displayed totals. Also, the collector itself
emits a new report at most once per PGSTAT_STAT_INTERVAL
milliseconds (500 ms unless altered while building the server). So the
displayed information lags behind actual activity. However, current-query
information collected by track_activities
is
always up-to-date.
Another important point is that when a server process is asked to display
any of these statistics, it first fetches the most recent report emitted by
the collector process and then continues to use this snapshot for all
statistical views and functions until the end of its current transaction.
So the statistics will show static information as long as you continue the
current transaction. Similarly, information about the current queries of
all sessions is collected when any such information is first requested
within a transaction, and the same information will be displayed throughout
the transaction.
This is a feature, not a bug, because it allows you to perform several
queries on the statistics and correlate the results without worrying that
the numbers are changing underneath you. But if you want to see new
results with each query, be sure to do the queries outside any transaction
block. Alternatively, you can invoke
pg_stat_clear_snapshot
(), which will discard the
current transaction's statistics snapshot (if any). The next use of
statistical information will cause a new snapshot to be fetched.
A transaction can also see its own statistics (as yet untransmitted to the
collector) in the views pg_stat_xact_all_tables
,
pg_stat_xact_sys_tables
,
pg_stat_xact_user_tables
, and
pg_stat_xact_user_functions
. These numbers do not act as
stated above; instead they update continuously throughout the transaction.
Some of the information in the dynamic statistics views shown in Table 28.1 is security restricted.
Ordinary users can only see all the information about their own sessions
(sessions belonging to a role that they are a member of). In rows about
other sessions, many columns will be null. Note, however, that the
existence of a session and its general properties such as its sessions user
and database are visible to all users. Superusers and members of the
built-in role pg_read_all_stats
(see also Section 22.5) can see all the information about all sessions.
Table 28.1. Dynamic Statistics Views
View Name | Description |
---|---|
pg_stat_activity
|
One row per server process, showing information related to
the current activity of that process, such as state and current query.
See
pg_stat_activity for details.
|
pg_stat_replication | One row per WAL sender process, showing statistics about
replication to that sender's connected standby server.
See
pg_stat_replication for details.
|
pg_stat_wal_receiver | Only one row, showing statistics about the WAL receiver from
that receiver's connected server.
See
pg_stat_wal_receiver for details.
|
pg_stat_subscription | At least one row per subscription, showing information about
the subscription workers.
See
pg_stat_subscription for details.
|
pg_stat_ssl | One row per connection (regular and replication), showing information about
SSL used on this connection.
See
pg_stat_ssl for details.
|
pg_stat_gssapi | One row per connection (regular and replication), showing information about
GSSAPI authentication and encryption used on this connection.
See
pg_stat_gssapi for details.
|
pg_stat_progress_analyze | One row for each backend (including autovacuum worker processes) running
ANALYZE , showing current progress.
See Section 28.4.1.
|
pg_stat_progress_create_index | One row for each backend running CREATE INDEX or REINDEX , showing
current progress.
See Section 28.4.2.
|
pg_stat_progress_vacuum | One row for each backend (including autovacuum worker processes) running
VACUUM , showing current progress.
See Section 28.4.3.
|
pg_stat_progress_cluster | One row for each backend running
CLUSTER or VACUUM FULL , showing current progress.
See Section 28.4.4.
|
pg_stat_progress_basebackup | One row for each WAL sender process streaming a base backup, showing current progress. See Section 28.4.5. |
pg_stat_progress_copy | One row for each backend running COPY , showing current progress.
See Section 28.4.6.
|
Table 28.2. Collected Statistics Views
View Name | Description |
---|---|
pg_stat_archiver | One row only, showing statistics about the
WAL archiver process's activity. See
pg_stat_archiver for details.
|
pg_stat_bgwriter | One row only, showing statistics about the
background writer process's activity. See
pg_stat_bgwriter for details.
|
pg_stat_wal | One row only, showing statistics about WAL activity. See
pg_stat_wal for details.
|
pg_stat_database | One row per database, showing database-wide statistics. See
pg_stat_database for details.
|
pg_stat_database_conflicts |
One row per database, showing database-wide statistics about
query cancels due to conflict with recovery on standby servers.
See
pg_stat_database_conflicts for details.
|
pg_stat_all_tables |
One row for each table in the current database, showing statistics
about accesses to that specific table.
See
pg_stat_all_tables for details.
|
pg_stat_sys_tables | Same as pg_stat_all_tables , except that only
system tables are shown. |
pg_stat_user_tables | Same as pg_stat_all_tables , except that only user
tables are shown. |
pg_stat_xact_all_tables | Similar to pg_stat_all_tables , but counts actions
taken so far within the current transaction (which are not
yet included in pg_stat_all_tables and related views).
The columns for numbers of live and dead rows and vacuum and
analyze actions are not present in this view. |
pg_stat_xact_sys_tables | Same as pg_stat_xact_all_tables , except that only
system tables are shown. |
pg_stat_xact_user_tables | Same as pg_stat_xact_all_tables , except that only
user tables are shown. |
pg_stat_all_indexes |
One row for each index in the current database, showing statistics
about accesses to that specific index.
See
pg_stat_all_indexes for details.
|
pg_stat_sys_indexes | Same as pg_stat_all_indexes , except that only
indexes on system tables are shown. |
pg_stat_user_indexes | Same as pg_stat_all_indexes , except that only
indexes on user tables are shown. |
pg_statio_all_tables |
One row for each table in the current database, showing statistics
about I/O on that specific table.
See
pg_statio_all_tables for details.
|
pg_statio_sys_tables | Same as pg_statio_all_tables , except that only
system tables are shown. |
pg_statio_user_tables | Same as pg_statio_all_tables , except that only
user tables are shown. |
pg_statio_all_indexes |
One row for each index in the current database,
showing statistics about I/O on that specific index.
See
pg_statio_all_indexes for details.
|
pg_statio_sys_indexes | Same as pg_statio_all_indexes , except that only
indexes on system tables are shown. |
pg_statio_user_indexes | Same as pg_statio_all_indexes , except that only
indexes on user tables are shown. |
pg_statio_all_sequences |
One row for each sequence in the current database,
showing statistics about I/O on that specific sequence.
See
pg_statio_all_sequences for details.
|
pg_statio_sys_sequences | Same as pg_statio_all_sequences , except that only
system sequences are shown. (Presently, no system sequences are defined,
so this view is always empty.) |
pg_statio_user_sequences | Same as pg_statio_all_sequences , except that only
user sequences are shown. |
pg_stat_user_functions |
One row for each tracked function, showing statistics
about executions of that function. See
pg_stat_user_functions for details.
|
pg_stat_xact_user_functions | Similar to pg_stat_user_functions , but counts only
calls during the current transaction (which are not
yet included in pg_stat_user_functions ). |
pg_stat_slru | One row per SLRU, showing statistics of operations. See
pg_stat_slru for details.
|
pg_stat_replication_slots | One row per replication slot, showing statistics about the
replication slot's usage. See
pg_stat_replication_slots for details.
|
The per-index statistics are particularly useful to determine which indexes are being used and how effective they are.
The pg_statio_
views are primarily useful to
determine the effectiveness of the buffer cache. When the number
of actual disk reads is much smaller than the number of buffer
hits, then the cache is satisfying most read requests without
invoking a kernel call. However, these statistics do not give the
entire story: due to the way in which PostgreSQL
handles disk I/O, data that is not in the
PostgreSQL buffer cache might still reside in the
kernel's I/O cache, and might therefore still be fetched without
requiring a physical read. Users interested in obtaining more
detailed information on PostgreSQL I/O behavior are
advised to use the PostgreSQL statistics collector
in combination with operating system utilities that allow insight
into the kernel's handling of I/O.
pg_stat_activity
The pg_stat_activity
view will have one row
per server process, showing information related to
the current activity of that process.
Table 28.3. pg_stat_activity
View
Column Type Description |
---|
OID of the database this backend is connected to |
Name of the database this backend is connected to |
Process ID of this backend |
Process ID of the parallel group leader, if this process is a
parallel query worker. |
OID of the user logged into this backend |
Name of the user logged into this backend |
Name of the application that is connected to this backend |
IP address of the client connected to this backend. If this field is null, it indicates either that the client is connected via a Unix socket on the server machine or that this is an internal process such as autovacuum. |
Host name of the connected client, as reported by a
reverse DNS lookup of |
TCP port number that the client is using for communication
with this backend, or |
Time when this process was started. For client backends, this is the time the client connected to the server. |
Time when this process' current transaction was started, or null
if no transaction is active. If the current
query is the first of its transaction, this column is equal to the
|
Time when the currently active query was started, or if
|
Time when the |
The type of event for which the backend is waiting, if any; otherwise NULL. See Table 28.4. |
Wait event name if backend is currently waiting, otherwise NULL. See Table 28.5 through Table 28.13. |
Current overall state of this backend. Possible values are:
|
Top-level transaction identifier of this backend, if any. |
The current backend's |
Identifier of this backend's most recent query. If
|
Text of this backend's most recent query. If
|
Type of current backend. Possible types are
|
The wait_event
and state
columns are
independent. If a backend is in the active
state,
it may or may not be waiting
on some event. If the state
is active
and wait_event
is non-null, it
means that a query is being executed, but is being blocked somewhere
in the system.
Table 28.4. Wait Event Types
Wait Event Type | Description |
---|---|
Activity | The server process is idle. This event type indicates a process
waiting for activity in its main processing loop.
wait_event will identify the specific wait point;
see Table 28.5.
|
BufferPin | The server process is waiting for exclusive access to a data buffer. Buffer pin waits can be protracted if another process holds an open cursor that last read data from the buffer in question. See Table 28.6. |
Client | The server process is waiting for activity on a socket
connected to a user application. Thus, the server expects something
to happen that is independent of its internal processes.
wait_event will identify the specific wait point;
see Table 28.7.
|
Extension | The server process is waiting for some condition defined by an extension module. See Table 28.8. |
IO | The server process is waiting for an I/O operation to complete.
wait_event will identify the specific wait point;
see Table 28.9.
|
IPC | The server process is waiting for some interaction with
another server process. wait_event will
identify the specific wait point;
see Table 28.10.
|
Lock | The server process is waiting for a heavyweight lock.
Heavyweight locks, also known as lock manager locks or simply locks,
primarily protect SQL-visible objects such as tables. However,
they are also used to ensure mutual exclusion for certain internal
operations such as relation extension. wait_event
will identify the type of lock awaited;
see Table 28.11.
|
LWLock | The server process is waiting for a lightweight lock.
Most such locks protect a particular data structure in shared memory.
wait_event will contain a name identifying the purpose
of the lightweight lock. (Some locks have specific names; others
are part of a group of locks each with a similar purpose.)
See Table 28.12.
|
Timeout | The server process is waiting for a timeout
to expire. wait_event will identify the specific wait
point; see Table 28.13.
|
Table 28.5. Wait Events of Type Activity
Activity Wait Event | Description |
---|---|
ArchiverMain | Waiting in main loop of archiver process. |
AutoVacuumMain | Waiting in main loop of autovacuum launcher process. |
BgWriterHibernate | Waiting in background writer process, hibernating. |
BgWriterMain | Waiting in main loop of background writer process. |
CheckpointerMain | Waiting in main loop of checkpointer process. |
LogicalApplyMain | Waiting in main loop of logical replication apply process. |
LogicalLauncherMain | Waiting in main loop of logical replication launcher process. |
PgStatMain | Waiting in main loop of statistics collector process. |
RecoveryWalStream | Waiting in main loop of startup process for WAL to arrive, during streaming recovery. |
SysLoggerMain | Waiting in main loop of syslogger process. |
WalReceiverMain | Waiting in main loop of WAL receiver process. |
WalSenderMain | Waiting in main loop of WAL sender process. |
WalWriterMain | Waiting in main loop of WAL writer process. |
Table 28.6. Wait Events of Type BufferPin
BufferPin Wait Event | Description |
---|---|
BufferPin | Waiting to acquire an exclusive pin on a buffer. |
Table 28.7. Wait Events of Type Client
Client Wait Event | Description |
---|---|
ClientRead | Waiting to read data from the client. |
ClientWrite | Waiting to write data to the client. |
GSSOpenServer | Waiting to read data from the client while establishing a GSSAPI session. |
LibPQWalReceiverConnect | Waiting in WAL receiver to establish connection to remote server. |
LibPQWalReceiverReceive | Waiting in WAL receiver to receive data from remote server. |
SSLOpenServer | Waiting for SSL while attempting connection. |
WalSenderWaitForWAL | Waiting for WAL to be flushed in WAL sender process. |
WalSenderWriteData | Waiting for any activity when processing replies from WAL receiver in WAL sender process. |
Table 28.8. Wait Events of Type Extension
Extension Wait Event | Description |
---|---|
Extension | Waiting in an extension. |
Table 28.9. Wait Events of Type IO
IO Wait Event | Description |
---|---|
BaseBackupRead | Waiting for base backup to read from a file. |
BufFileRead | Waiting for a read from a buffered file. |
BufFileWrite | Waiting for a write to a buffered file. |
BufFileTruncate | Waiting for a buffered file to be truncated. |
ControlFileRead | Waiting for a read from the pg_control
file. |
ControlFileSync | Waiting for the pg_control file to reach
durable storage. |
ControlFileSyncUpdate | Waiting for an update to the pg_control file
to reach durable storage. |
ControlFileWrite | Waiting for a write to the pg_control
file. |
ControlFileWriteUpdate | Waiting for a write to update the pg_control
file. |
CopyFileRead | Waiting for a read during a file copy operation. |
CopyFileWrite | Waiting for a write during a file copy operation. |
DSMFillZeroWrite | Waiting to fill a dynamic shared memory backing file with zeroes. |
DataFileExtend | Waiting for a relation data file to be extended. |
DataFileFlush | Waiting for a relation data file to reach durable storage. |
DataFileImmediateSync | Waiting for an immediate synchronization of a relation data file to durable storage. |
DataFilePrefetch | Waiting for an asynchronous prefetch from a relation data file. |
DataFileRead | Waiting for a read from a relation data file. |
DataFileSync | Waiting for changes to a relation data file to reach durable storage. |
DataFileTruncate | Waiting for a relation data file to be truncated. |
DataFileWrite | Waiting for a write to a relation data file. |
LockFileAddToDataDirRead | Waiting for a read while adding a line to the data directory lock file. |
LockFileAddToDataDirSync | Waiting for data to reach durable storage while adding a line to the data directory lock file. |
LockFileAddToDataDirWrite | Waiting for a write while adding a line to the data directory lock file. |
LockFileCreateRead | Waiting to read while creating the data directory lock file. |
LockFileCreateSync | Waiting for data to reach durable storage while creating the data directory lock file. |
LockFileCreateWrite | Waiting for a write while creating the data directory lock file. |
LockFileReCheckDataDirRead | Waiting for a read during recheck of the data directory lock file. |
LogicalRewriteCheckpointSync | Waiting for logical rewrite mappings to reach durable storage during a checkpoint. |
LogicalRewriteMappingSync | Waiting for mapping data to reach durable storage during a logical rewrite. |
LogicalRewriteMappingWrite | Waiting for a write of mapping data during a logical rewrite. |
LogicalRewriteSync | Waiting for logical rewrite mappings to reach durable storage. |
LogicalRewriteTruncate | Waiting for truncate of mapping data during a logical rewrite. |
LogicalRewriteWrite | Waiting for a write of logical rewrite mappings. |
RelationMapRead | Waiting for a read of the relation map file. |
RelationMapSync | Waiting for the relation map file to reach durable storage. |
RelationMapWrite | Waiting for a write to the relation map file. |
ReorderBufferRead | Waiting for a read during reorder buffer management. |
ReorderBufferWrite | Waiting for a write during reorder buffer management. |
ReorderLogicalMappingRead | Waiting for a read of a logical mapping during reorder buffer management. |
ReplicationSlotRead | Waiting for a read from a replication slot control file. |
ReplicationSlotRestoreSync | Waiting for a replication slot control file to reach durable storage while restoring it to memory. |
ReplicationSlotSync | Waiting for a replication slot control file to reach durable storage. |
ReplicationSlotWrite | Waiting for a write to a replication slot control file. |
SLRUFlushSync | Waiting for SLRU data to reach durable storage during a checkpoint or database shutdown. |
SLRURead | Waiting for a read of an SLRU page. |
SLRUSync | Waiting for SLRU data to reach durable storage following a page write. |
SLRUWrite | Waiting for a write of an SLRU page. |
SnapbuildRead | Waiting for a read of a serialized historical catalog snapshot. |
SnapbuildSync | Waiting for a serialized historical catalog snapshot to reach durable storage. |
SnapbuildWrite | Waiting for a write of a serialized historical catalog snapshot. |
TimelineHistoryFileSync | Waiting for a timeline history file received via streaming replication to reach durable storage. |
TimelineHistoryFileWrite | Waiting for a write of a timeline history file received via streaming replication. |
TimelineHistoryRead | Waiting for a read of a timeline history file. |
TimelineHistorySync | Waiting for a newly created timeline history file to reach durable storage. |
TimelineHistoryWrite | Waiting for a write of a newly created timeline history file. |
TwophaseFileRead | Waiting for a read of a two phase state file. |
TwophaseFileSync | Waiting for a two phase state file to reach durable storage. |
TwophaseFileWrite | Waiting for a write of a two phase state file. |
WALBootstrapSync | Waiting for WAL to reach durable storage during bootstrapping. |
WALBootstrapWrite | Waiting for a write of a WAL page during bootstrapping. |
WALCopyRead | Waiting for a read when creating a new WAL segment by copying an existing one. |
WALCopySync | Waiting for a new WAL segment created by copying an existing one to reach durable storage. |
WALCopyWrite | Waiting for a write when creating a new WAL segment by copying an existing one. |
WALInitSync | Waiting for a newly initialized WAL file to reach durable storage. |
WALInitWrite | Waiting for a write while initializing a new WAL file. |
WALRead | Waiting for a read from a WAL file. |
WALSenderTimelineHistoryRead | Waiting for a read from a timeline history file during a walsender timeline command. |
WALSync | Waiting for a WAL file to reach durable storage. |
WALSyncMethodAssign | Waiting for data to reach durable storage while assigning a new WAL sync method. |
WALWrite | Waiting for a write to a WAL file. |
LogicalChangesRead | Waiting for a read from a logical changes file. |
LogicalChangesWrite | Waiting for a write to a logical changes file. |
LogicalSubxactRead | Waiting for a read from a logical subxact file. |
LogicalSubxactWrite | Waiting for a write to a logical subxact file. |
Table 28.10. Wait Events of Type IPC
IPC Wait Event | Description |
---|---|
AppendReady | Waiting for subplan nodes of an Append plan
node to be ready. |
BackendTermination | Waiting for the termination of another backend. |
BackupWaitWalArchive | Waiting for WAL files required for a backup to be successfully archived. |
BgWorkerShutdown | Waiting for background worker to shut down. |
BgWorkerStartup | Waiting for background worker to start up. |
BtreePage | Waiting for the page number needed to continue a parallel B-tree scan to become available. |
BufferIO | Waiting for buffer I/O to complete. |
CheckpointDone | Waiting for a checkpoint to complete. |
CheckpointStart | Waiting for a checkpoint to start. |
ExecuteGather | Waiting for activity from a child process while
executing a Gather plan node. |
HashBatchAllocate | Waiting for an elected Parallel Hash participant to allocate a hash table. |
HashBatchElect | Waiting to elect a Parallel Hash participant to allocate a hash table. |
HashBatchLoad | Waiting for other Parallel Hash participants to finish loading a hash table. |
HashBuildAllocate | Waiting for an elected Parallel Hash participant to allocate the initial hash table. |
HashBuildElect | Waiting to elect a Parallel Hash participant to allocate the initial hash table. |
HashBuildHashInner | Waiting for other Parallel Hash participants to finish hashing the inner relation. |
HashBuildHashOuter | Waiting for other Parallel Hash participants to finish partitioning the outer relation. |
HashGrowBatchesAllocate | Waiting for an elected Parallel Hash participant to allocate more batches. |
HashGrowBatchesDecide | Waiting to elect a Parallel Hash participant to decide on future batch growth. |
HashGrowBatchesElect | Waiting to elect a Parallel Hash participant to allocate more batches. |
HashGrowBatchesFinish | Waiting for an elected Parallel Hash participant to decide on future batch growth. |
HashGrowBatchesRepartition | Waiting for other Parallel Hash participants to finish repartitioning. |
HashGrowBucketsAllocate | Waiting for an elected Parallel Hash participant to finish allocating more buckets. |
HashGrowBucketsElect | Waiting to elect a Parallel Hash participant to allocate more buckets. |
HashGrowBucketsReinsert | Waiting for other Parallel Hash participants to finish inserting tuples into new buckets. |
LogicalSyncData | Waiting for a logical replication remote server to send data for initial table synchronization. |
LogicalSyncStateChange | Waiting for a logical replication remote server to change state. |
MessageQueueInternal | Waiting for another process to be attached to a shared message queue. |
MessageQueuePutMessage | Waiting to write a protocol message to a shared message queue. |
MessageQueueReceive | Waiting to receive bytes from a shared message queue. |
MessageQueueSend | Waiting to send bytes to a shared message queue. |
ParallelBitmapScan | Waiting for parallel bitmap scan to become initialized. |
ParallelCreateIndexScan | Waiting for parallel CREATE INDEX workers to
finish heap scan. |
ParallelFinish | Waiting for parallel workers to finish computing. |
ProcArrayGroupUpdate | Waiting for the group leader to clear the transaction ID at end of a parallel operation. |
ProcSignalBarrier | Waiting for a barrier event to be processed by all backends. |
Promote | Waiting for standby promotion. |
RecoveryConflictSnapshot | Waiting for recovery conflict resolution for a vacuum cleanup. |
RecoveryConflictTablespace | Waiting for recovery conflict resolution for dropping a tablespace. |
RecoveryPause | Waiting for recovery to be resumed. |
ReplicationOriginDrop | Waiting for a replication origin to become inactive so it can be dropped. |
ReplicationSlotDrop | Waiting for a replication slot to become inactive so it can be dropped. |
SafeSnapshot | Waiting to obtain a valid snapshot for a READ ONLY
DEFERRABLE transaction. |
SyncRep | Waiting for confirmation from a remote server during synchronous replication. |
WalReceiverExit | Waiting for the WAL receiver to exit. |
WalReceiverWaitStart | Waiting for startup process to send initial data for streaming replication. |
XactGroupUpdate | Waiting for the group leader to update transaction status at end of a parallel operation. |
Table 28.11. Wait Events of Type Lock
Lock Wait Event | Description |
---|---|
advisory | Waiting to acquire an advisory user lock. |
extend | Waiting to extend a relation. |
frozenid | Waiting to
update pg_database .datfrozenxid
and pg_database .datminmxid . |
object | Waiting to acquire a lock on a non-relation database object. |
page | Waiting to acquire a lock on a page of a relation. |
relation | Waiting to acquire a lock on a relation. |
spectoken | Waiting to acquire a speculative insertion lock. |
transactionid | Waiting for a transaction to finish. |
tuple | Waiting to acquire a lock on a tuple. |
userlock | Waiting to acquire a user lock. |
virtualxid | Waiting to acquire a virtual transaction ID lock. |
Table 28.12. Wait Events of Type LWLock
LWLock Wait Event | Description |
---|---|
AddinShmemInit | Waiting to manage an extension's space allocation in shared memory. |
AutoFile | Waiting to update the postgresql.auto.conf
file. |
Autovacuum | Waiting to read or update the current state of autovacuum workers. |
AutovacuumSchedule | Waiting to ensure that a table selected for autovacuum still needs vacuuming. |
BackgroundWorker | Waiting to read or update background worker state. |
BtreeVacuum | Waiting to read or update vacuum-related information for a B-tree index. |
BufferContent | Waiting to access a data page in memory. |
BufferMapping | Waiting to associate a data block with a buffer in the buffer pool. |
CheckpointerComm | Waiting to manage fsync requests. |
CommitTs | Waiting to read or update the last value set for a transaction commit timestamp. |
CommitTsBuffer | Waiting for I/O on a commit timestamp SLRU buffer. |
CommitTsSLRU | Waiting to access the commit timestamp SLRU cache. |
ControlFile | Waiting to read or update the pg_control
file or create a new WAL file. |
DynamicSharedMemoryControl | Waiting to read or update dynamic shared memory allocation information. |
LockFastPath | Waiting to read or update a process' fast-path lock information. |
LockManager | Waiting to read or update information about “heavyweight” locks. |
LogicalRepWorker | Waiting to read or update the state of logical replication workers. |
MultiXactGen | Waiting to read or update shared multixact state. |
MultiXactMemberBuffer | Waiting for I/O on a multixact member SLRU buffer. |
MultiXactMemberSLRU | Waiting to access the multixact member SLRU cache. |
MultiXactOffsetBuffer | Waiting for I/O on a multixact offset SLRU buffer. |
MultiXactOffsetSLRU | Waiting to access the multixact offset SLRU cache. |
MultiXactTruncation | Waiting to read or truncate multixact information. |
NotifyBuffer | Waiting for I/O on a NOTIFY message SLRU
buffer. |
NotifyQueue | Waiting to read or update NOTIFY messages. |
NotifyQueueTail | Waiting to update limit on NOTIFY message
storage. |
NotifySLRU | Waiting to access the NOTIFY message SLRU
cache. |
OidGen | Waiting to allocate a new OID. |
OldSnapshotTimeMap | Waiting to read or update old snapshot control information. |
ParallelAppend | Waiting to choose the next subplan during Parallel Append plan execution. |
ParallelHashJoin | Waiting to synchronize workers during Parallel Hash Join plan execution. |
ParallelQueryDSA | Waiting for parallel query dynamic shared memory allocation. |
PerSessionDSA | Waiting for parallel query dynamic shared memory allocation. |
PerSessionRecordType | Waiting to access a parallel query's information about composite types. |
PerSessionRecordTypmod | Waiting to access a parallel query's information about type modifiers that identify anonymous record types. |
PerXactPredicateList | Waiting to access the list of predicate locks held by the current serializable transaction during a parallel query. |
PredicateLockManager | Waiting to access predicate lock information used by serializable transactions. |
ProcArray | Waiting to access the shared per-process data structures (typically, to get a snapshot or report a session's transaction ID). |
RelationMapping | Waiting to read or update
a pg_filenode.map file (used to track the
filenode assignments of certain system catalogs). |
RelCacheInit | Waiting to read or update a pg_internal.init
relation cache initialization file. |
ReplicationOrigin | Waiting to create, drop or use a replication origin. |
ReplicationOriginState | Waiting to read or update the progress of one replication origin. |
ReplicationSlotAllocation | Waiting to allocate or free a replication slot. |
ReplicationSlotControl | Waiting to read or update replication slot state. |
ReplicationSlotIO | Waiting for I/O on a replication slot. |
SerialBuffer | Waiting for I/O on a serializable transaction conflict SLRU buffer. |
SerializableFinishedList | Waiting to access the list of finished serializable transactions. |
SerializablePredicateList | Waiting to access the list of predicate locks held by serializable transactions. |
SerializableXactHash | Waiting to read or update information about serializable transactions. |
SerialSLRU | Waiting to access the serializable transaction conflict SLRU cache. |
SharedTidBitmap | Waiting to access a shared TID bitmap during a parallel bitmap index scan. |
SharedTupleStore | Waiting to access a shared tuple store during parallel query. |
ShmemIndex | Waiting to find or allocate space in shared memory. |
SInvalRead | Waiting to retrieve messages from the shared catalog invalidation queue. |
SInvalWrite | Waiting to add a message to the shared catalog invalidation queue. |
SubtransBuffer | Waiting for I/O on a sub-transaction SLRU buffer. |
SubtransSLRU | Waiting to access the sub-transaction SLRU cache. |
SyncRep | Waiting to read or update information about the state of synchronous replication. |
SyncScan | Waiting to select the starting location of a synchronized table scan. |
TablespaceCreate | Waiting to create or drop a tablespace. |
TwoPhaseState | Waiting to read or update the state of prepared transactions. |
WALBufMapping | Waiting to replace a page in WAL buffers. |
WALInsert | Waiting to insert WAL data into a memory buffer. |
WALWrite | Waiting for WAL buffers to be written to disk. |
WrapLimitsVacuum | Waiting to update limits on transaction id and multixact consumption. |
XactBuffer | Waiting for I/O on a transaction status SLRU buffer. |
XactSLRU | Waiting to access the transaction status SLRU cache. |
XactTruncation | Waiting to execute pg_xact_status or update
the oldest transaction ID available to it. |
XidGen | Waiting to allocate a new transaction ID. |
Extensions can add LWLock
types to the list shown in
Table 28.12. In some cases, the name
assigned by an extension will not be available in all server processes;
so an LWLock
wait event might be reported as
just “extension
” rather than the
extension-assigned name.
Table 28.13. Wait Events of Type Timeout
Timeout Wait Event | Description |
---|---|
BaseBackupThrottle | Waiting during base backup when throttling activity. |
CheckpointWriteDelay | Waiting between writes while performing a checkpoint. |
PgSleep | Waiting due to a call to pg_sleep or
a sibling function. |
RecoveryApplyDelay | Waiting to apply WAL during recovery because of a delay setting. |
RecoveryRetrieveRetryInterval | Waiting during recovery when WAL data is not available from any
source (pg_wal , archive or stream). |
RegisterSyncRequest | Waiting while sending synchronization requests to the checkpointer, because the request queue is full. |
VacuumDelay | Waiting in a cost-based vacuum delay point. |
Here is an example of how wait events can be viewed:
SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event is NOT NULL; pid | wait_event_type | wait_event ------+-----------------+------------ 2540 | Lock | relation 6644 | LWLock | ProcArray (2 rows)
pg_stat_replication
The pg_stat_replication
view will contain one row
per WAL sender process, showing statistics about replication to that
sender's connected standby server. Only directly connected standbys are
listed; no information is available about downstream standby servers.
Table 28.14. pg_stat_replication
View
Column Type Description |
---|
Process ID of a WAL sender process |
OID of the user logged into this WAL sender process |
Name of the user logged into this WAL sender process |
Name of the application that is connected to this WAL sender |
IP address of the client connected to this WAL sender. If this field is null, it indicates that the client is connected via a Unix socket on the server machine. |
Host name of the connected client, as reported by a
reverse DNS lookup of |
TCP port number that the client is using for communication
with this WAL sender, or |
Time when this process was started, i.e., when the client connected to this WAL sender |
This standby's |
Current WAL sender state. Possible values are:
|
Last write-ahead log location sent on this connection |
Last write-ahead log location written to disk by this standby server |
Last write-ahead log location flushed to disk by this standby server |
Last write-ahead log location replayed into the database on this standby server |
Time elapsed between flushing recent WAL locally and receiving
notification that this standby server has written it (but not yet
flushed it or applied it). This can be used to gauge the delay that
|
Time elapsed between flushing recent WAL locally and receiving
notification that this standby server has written and flushed it
(but not yet applied it). This can be used to gauge the delay that
|
Time elapsed between flushing recent WAL locally and receiving
notification that this standby server has written, flushed and
applied it. This can be used to gauge the delay that
|
Priority of this standby server for being chosen as the synchronous standby in a priority-based synchronous replication. This has no effect in a quorum-based synchronous replication. |
Synchronous state of this standby server. Possible values are:
|
Send time of last reply message received from standby server |
The lag times reported in the pg_stat_replication
view are measurements of the time taken for recent WAL to be written,
flushed and replayed and for the sender to know about it. These times
represent the commit delay that was (or would have been) introduced by each
synchronous commit level, if the remote server was configured as a
synchronous standby. For an asynchronous standby, the
replay_lag
column approximates the delay
before recent transactions became visible to queries. If the standby
server has entirely caught up with the sending server and there is no more
WAL activity, the most recently measured lag times will continue to be
displayed for a short time and then show NULL.
Lag times work automatically for physical replication. Logical decoding plugins may optionally emit tracking messages; if they do not, the tracking mechanism will simply display NULL lag.
The reported lag times are not predictions of how long it will take for
the standby to catch up with the sending server assuming the current
rate of replay. Such a system would show similar times while new WAL is
being generated, but would differ when the sender becomes idle. In
particular, when the standby has caught up completely,
pg_stat_replication
shows the time taken to
write, flush and replay the most recent reported WAL location rather than
zero as some users might expect. This is consistent with the goal of
measuring synchronous commit and transaction visibility delays for
recent write transactions.
To reduce confusion for users expecting a different model of lag, the
lag columns revert to NULL after a short time on a fully replayed idle
system. Monitoring systems should choose whether to represent this
as missing data, zero or continue to display the last known value.
pg_stat_replication_slots
The pg_stat_replication_slots
view will contain
one row per logical replication slot, showing statistics about its usage.
Table 28.15. pg_stat_replication_slots
View
Column Type Description |
---|
A unique, cluster-wide identifier for the replication slot |
Number of transactions spilled to disk once the memory used by
logical decoding to decode changes from WAL has exceeded
|
Number of times transactions were spilled to disk while decoding changes from WAL for this slot. This counter is incremented each time a transaction is spilled, and the same transaction may be spilled multiple times. |
Amount of decoded transaction data spilled to disk while performing
decoding of changes from WAL for this slot. This and other spill
counters can be used to gauge the I/O which occurred during logical
decoding and allow tuning |
Number of in-progress transactions streamed to the decoding output
plugin after the memory used by logical decoding to decode changes
from WAL for this slot has exceeded
|
Number of times in-progress transactions were streamed to the decoding output plugin while decoding changes from WAL for this slot. This counter is incremented each time a transaction is streamed, and the same transaction may be streamed multiple times. |
Amount of transaction data decoded for streaming in-progress
transactions to the decoding output plugin while decoding changes from
WAL for this slot. This and other streaming counters for this slot can
be used to tune |
Number of decoded transactions sent to the decoding output plugin for this slot. This counts top-level transactions only, and is not incremented for subtransactions. Note that this includes the transactions that are streamed and/or spilled. |
Amount of transaction data decoded for sending transactions to the decoding output plugin while decoding changes from WAL for this slot. Note that this includes data that is streamed and/or spilled. |
Time at which these statistics were last reset |
pg_stat_wal_receiver
The pg_stat_wal_receiver
view will contain only
one row, showing statistics about the WAL receiver from that receiver's
connected server.
Table 28.16. pg_stat_wal_receiver
View
Column Type Description |
---|
Process ID of the WAL receiver process |
Activity status of the WAL receiver process |
First write-ahead log location used when WAL receiver is started |
First timeline number used when WAL receiver is started |
Last write-ahead log location already received and written to disk, but not flushed. This should not be used for data integrity checks. |
Last write-ahead log location already received and flushed to disk, the initial value of this field being the first log location used when WAL receiver is started |
Timeline number of last write-ahead log location received and flushed to disk, the initial value of this field being the timeline number of the first log location used when WAL receiver is started |
Send time of last message received from origin WAL sender |
Receipt time of last message received from origin WAL sender |
Last write-ahead log location reported to origin WAL sender |
Time of last write-ahead log location reported to origin WAL sender |
Replication slot name used by this WAL receiver |
Host of the PostgreSQL instance
this WAL receiver is connected to. This can be a host name,
an IP address, or a directory path if the connection is via
Unix socket. (The path case can be distinguished because it
will always be an absolute path, beginning with |
Port number of the PostgreSQL instance this WAL receiver is connected to. |
Connection string used by this WAL receiver, with security-sensitive fields obfuscated. |
pg_stat_subscription
The pg_stat_subscription
view will contain one
row per subscription for main worker (with null PID if the worker is
not running), and additional rows for workers handling the initial data
copy of the subscribed tables.
Table 28.17. pg_stat_subscription
View
Column Type Description |
---|
OID of the subscription |
Name of the subscription |
Process ID of the subscription worker process |
OID of the relation that the worker is synchronizing; null for the main apply worker |
Last write-ahead log location received, the initial value of this field being 0 |
Send time of last message received from origin WAL sender |
Receipt time of last message received from origin WAL sender |
Last write-ahead log location reported to origin WAL sender |
Time of last write-ahead log location reported to origin WAL sender |
pg_stat_ssl
The pg_stat_ssl
view will contain one row per
backend or WAL sender process, showing statistics about SSL usage on
this connection. It can be joined to pg_stat_activity
or pg_stat_replication
on the
pid
column to get more details about the
connection.
Table 28.18. pg_stat_ssl
View
Column Type Description |
---|
Process ID of a backend or WAL sender process |
True if SSL is used on this connection |
Version of SSL in use, or NULL if SSL is not in use on this connection |
Name of SSL cipher in use, or NULL if SSL is not in use on this connection |
Number of bits in the encryption algorithm used, or NULL if SSL is not used on this connection |
Distinguished Name (DN) field from the client certificate
used, or NULL if no client certificate was supplied or if SSL
is not in use on this connection. This field is truncated if the
DN field is longer than |
Serial number of the client certificate, or NULL if no client certificate was supplied or if SSL is not in use on this connection. The combination of certificate serial number and certificate issuer uniquely identifies a certificate (unless the issuer erroneously reuses serial numbers). |
DN of the issuer of the client certificate, or NULL if no client
certificate was supplied or if SSL is not in use on this connection.
This field is truncated like |
pg_stat_gssapi
The pg_stat_gssapi
view will contain one row per
backend, showing information about GSSAPI usage on this connection. It can
be joined to pg_stat_activity
or
pg_stat_replication
on the
pid
column to get more details about the
connection.
Table 28.19. pg_stat_gssapi
View
Column Type Description |
---|
Process ID of a backend |
True if GSSAPI authentication was used for this connection |
Principal used to authenticate this connection, or NULL
if GSSAPI was not used to authenticate this connection. This
field is truncated if the principal is longer than
|
True if GSSAPI encryption is in use on this connection |
pg_stat_archiver
The pg_stat_archiver
view will always have a
single row, containing data about the archiver process of the cluster.
Table 28.20. pg_stat_archiver
View
Column Type Description |
---|
Number of WAL files that have been successfully archived |
Name of the last WAL file successfully archived |
Time of the last successful archive operation |
Number of failed attempts for archiving WAL files |
Name of the WAL file of the last failed archival operation |
Time of the last failed archival operation |
Time at which these statistics were last reset |
pg_stat_bgwriter
The pg_stat_bgwriter
view will always have a
single row, containing global data for the cluster.
Table 28.21. pg_stat_bgwriter
View
Column Type Description |
---|
Number of scheduled checkpoints that have been performed |
Number of requested checkpoints that have been performed |
Total amount of time that has been spent in the portion of checkpoint processing where files are written to disk, in milliseconds |
Total amount of time that has been spent in the portion of checkpoint processing where files are synchronized to disk, in milliseconds |
Number of buffers written during checkpoints |
Number of buffers written by the background writer |
Number of times the background writer stopped a cleaning scan because it had written too many buffers |
Number of buffers written directly by a backend |
Number of times a backend had to execute its own
|
Number of buffers allocated |
Time at which these statistics were last reset |
pg_stat_wal
The pg_stat_wal
view will always have a
single row, containing data about WAL activity of the cluster.
Table 28.22. pg_stat_wal
View
Column Type Description |
---|
Total number of WAL records generated |
Total number of WAL full page images generated |
Total amount of WAL generated in bytes |
Number of times WAL data was written to disk because WAL buffers became full |
Number of times WAL buffers were written out to disk via
|
Number of times WAL files were synced to disk via
|
Total amount of time spent writing WAL buffers to disk via
|
Total amount of time spent syncing WAL files to disk via
|
Time at which these statistics were last reset |
pg_stat_database
The pg_stat_database
view will contain one row
for each database in the cluster, plus one for shared objects, showing
database-wide statistics.
Table 28.23. pg_stat_database
View
Column Type Description |
---|
OID of this database, or 0 for objects belonging to a shared relation |
Name of this database, or |
Number of backends currently connected to this database, or
|
Number of transactions in this database that have been committed |
Number of transactions in this database that have been rolled back |
Number of disk blocks read in this database |
Number of times disk blocks were found already in the buffer cache, so that a read was not necessary (this only includes hits in the PostgreSQL buffer cache, not the operating system's file system cache) |
Number of rows returned by queries in this database |
Number of rows fetched by queries in this database |
Number of rows inserted by queries in this database |
Number of rows updated by queries in this database |
Number of rows deleted by queries in this database |
Number of queries canceled due to conflicts with recovery
in this database. (Conflicts occur only on standby servers; see
|
Number of temporary files created by queries in this database. All temporary files are counted, regardless of why the temporary file was created (e.g., sorting or hashing), and regardless of the log_temp_files setting. |
Total amount of data written to temporary files by queries in this database. All temporary files are counted, regardless of why the temporary file was created, and regardless of the log_temp_files setting. |
Number of deadlocks detected in this database |
Number of data page checksum failures detected in this database (or on a shared object), or NULL if data checksums are not enabled. |
Time at which the last data page checksum failure was detected in this database (or on a shared object), or NULL if data checksums are not enabled. |
Time spent reading data file blocks by backends in this database, in milliseconds (if track_io_timing is enabled, otherwise zero) |
Time spent writing data file blocks by backends in this database, in milliseconds (if track_io_timing is enabled, otherwise zero) |
Time spent by database sessions in this database, in milliseconds (note that statistics are only updated when the state of a session changes, so if sessions have been idle for a long time, this idle time won't be included) |
Time spent executing SQL statements in this database, in milliseconds
(this corresponds to the states |
Time spent idling while in a transaction in this database, in milliseconds
(this corresponds to the states |
Total number of sessions established to this database |
Number of database sessions to this database that were terminated because connection to the client was lost |
Number of database sessions to this database that were terminated by fatal errors |
Number of database sessions to this database that were terminated by operator intervention |
Time at which these statistics were last reset |
pg_stat_database_conflicts
The pg_stat_database_conflicts
view will contain
one row per database, showing database-wide statistics about
query cancels occurring due to conflicts with recovery on standby servers.
This view will only contain information on standby servers, since
conflicts do not occur on primary servers.
Table 28.24. pg_stat_database_conflicts
View
Column Type Description |
---|
OID of a database |
Name of this database |
Number of queries in this database that have been canceled due to dropped tablespaces |
Number of queries in this database that have been canceled due to lock timeouts |
Number of queries in this database that have been canceled due to old snapshots |
Number of queries in this database that have been canceled due to pinned buffers |
Number of queries in this database that have been canceled due to deadlocks |
pg_stat_all_tables
The pg_stat_all_tables
view will contain
one row for each table in the current database (including TOAST
tables), showing statistics about accesses to that specific table. The
pg_stat_user_tables
and
pg_stat_sys_tables
views
contain the same information,
but filtered to only show user and system tables respectively.
Table 28.25. pg_stat_all_tables
View
Column Type Description |
---|
OID of a table |
Name of the schema that this table is in |
Name of this table |
Number of sequential scans initiated on this table |
Number of live rows fetched by sequential scans |
Number of index scans initiated on this table |
Number of live rows fetched by index scans |
Number of rows inserted |
Number of rows updated (includes HOT updated rows) |
Number of rows deleted |
Number of rows HOT updated (i.e., with no separate index update required) |
Estimated number of live rows |
Estimated number of dead rows |
Estimated number of rows modified since this table was last analyzed |
Estimated number of rows inserted since this table was last vacuumed |
Last time at which this table was manually vacuumed
(not counting |
Last time at which this table was vacuumed by the autovacuum daemon |
Last time at which this table was manually analyzed |
Last time at which this table was analyzed by the autovacuum daemon |
Number of times this table has been manually vacuumed
(not counting |
Number of times this table has been vacuumed by the autovacuum daemon |
Number of times this table has been manually analyzed |
Number of times this table has been analyzed by the autovacuum daemon |
pg_stat_all_indexes
The pg_stat_all_indexes
view will contain
one row for each index in the current database,
showing statistics about accesses to that specific index. The
pg_stat_user_indexes
and
pg_stat_sys_indexes
views
contain the same information,
but filtered to only show user and system indexes respectively.
Table 28.26. pg_stat_all_indexes
View
Column Type Description |
---|
OID of the table for this index |
OID of this index |
Name of the schema this index is in |
Name of the table for this index |
Name of this index |
Number of index scans initiated on this index |
Number of index entries returned by scans on this index |
Number of live table rows fetched by simple index scans using this index |
Indexes can be used by simple index scans, “bitmap” index scans,
and the optimizer. In a bitmap scan
the output of several indexes can be combined via AND or OR rules,
so it is difficult to associate individual heap row fetches
with specific indexes when a bitmap scan is used. Therefore, a bitmap
scan increments the
pg_stat_all_indexes
.idx_tup_read
count(s) for the index(es) it uses, and it increments the
pg_stat_all_tables
.idx_tup_fetch
count for the table, but it does not affect
pg_stat_all_indexes
.idx_tup_fetch
.
The optimizer also accesses indexes to check for supplied constants
whose values are outside the recorded range of the optimizer statistics
because the optimizer statistics might be stale.
The idx_tup_read
and idx_tup_fetch
counts
can be different even without any use of bitmap scans,
because idx_tup_read
counts
index entries retrieved from the index while idx_tup_fetch
counts live rows fetched from the table. The latter will be less if any
dead or not-yet-committed rows are fetched using the index, or if any
heap fetches are avoided by means of an index-only scan.
pg_statio_all_tables
The pg_statio_all_tables
view will contain
one row for each table in the current database (including TOAST
tables), showing statistics about I/O on that specific table. The
pg_statio_user_tables
and
pg_statio_sys_tables
views
contain the same information,
but filtered to only show user and system tables respectively.
Table 28.27. pg_statio_all_tables
View
Column Type Description |
---|
OID of a table |
Name of the schema that this table is in |
Name of this table |
Number of disk blocks read from this table |
Number of buffer hits in this table |
Number of disk blocks read from all indexes on this table |
Number of buffer hits in all indexes on this table |
Number of disk blocks read from this table's TOAST table (if any) |
Number of buffer hits in this table's TOAST table (if any) |
Number of disk blocks read from this table's TOAST table indexes (if any) |
Number of buffer hits in this table's TOAST table indexes (if any) |
pg_statio_all_indexes
The pg_statio_all_indexes
view will contain
one row for each index in the current database,
showing statistics about I/O on that specific index. The
pg_statio_user_indexes
and
pg_statio_sys_indexes
views
contain the same information,
but filtered to only show user and system indexes respectively.
Table 28.28. pg_statio_all_indexes
View
Column Type Description |
---|
OID of the table for this index |
OID of this index |
Name of the schema this index is in |
Name of the table for this index |
Name of this index |
Number of disk blocks read from this index |
Number of buffer hits in this index |
pg_statio_all_sequences
The pg_statio_all_sequences
view will contain
one row for each sequence in the current database,
showing statistics about I/O on that specific sequence.
Table 28.29. pg_statio_all_sequences
View
Column Type Description |
---|
OID of a sequence |
Name of the schema this sequence is in |
Name of this sequence |
Number of disk blocks read from this sequence |
Number of buffer hits in this sequence |
pg_stat_user_functions
The pg_stat_user_functions
view will contain
one row for each tracked function, showing statistics about executions of
that function. The track_functions parameter
controls exactly which functions are tracked.
Table 28.30. pg_stat_user_functions
View
Column Type Description |
---|
OID of a function |
Name of the schema this function is in |
Name of this function |
Number of times this function has been called |
Total time spent in this function and all other functions called by it, in milliseconds |
Total time spent in this function itself, not including other functions called by it, in milliseconds |
pg_stat_slru
PostgreSQL accesses certain on-disk information
via SLRU (simple least-recently-used) caches.
The pg_stat_slru
view will contain
one row for each tracked SLRU cache, showing statistics about access
to cached pages.
Table 28.31. pg_stat_slru
View
Column Type Description |
---|
Name of the SLRU |
Number of blocks zeroed during initializations |
Number of times disk blocks were found already in the SLRU, so that a read was not necessary (this only includes hits in the SLRU, not the operating system's file system cache) |
Number of disk blocks read for this SLRU |
Number of disk blocks written for this SLRU |
Number of blocks checked for existence for this SLRU |
Number of flushes of dirty data for this SLRU |
Number of truncates for this SLRU |
Time at which these statistics were last reset |
Other ways of looking at the statistics can be set up by writing
queries that use the same underlying statistics access functions used by
the standard views shown above. For details such as the functions' names,
consult the definitions of the standard views. (For example, in
psql you could issue \d+ pg_stat_activity
.)
The access functions for per-database statistics take a database OID as an
argument to identify which database to report on.
The per-table and per-index functions take a table or index OID.
The functions for per-function statistics take a function OID.
Note that only tables, indexes, and functions in the current database
can be seen with these functions.
Additional functions related to statistics collection are listed in Table 28.32.
Table 28.32. Additional Statistics Functions
Using pg_stat_reset()
also resets counters that
autovacuum uses to determine when to trigger a vacuum or an analyze.
Resetting these counters can cause autovacuum to not perform necessary
work, which can cause problems such as table bloat or out-dated
table statistics. A database-wide ANALYZE
is
recommended after the statistics have been reset.
pg_stat_get_activity
, the underlying function of
the pg_stat_activity
view, returns a set of records
containing all the available information about each backend process.
Sometimes it may be more convenient to obtain just a subset of this
information. In such cases, an older set of per-backend statistics
access functions can be used; these are shown in Table 28.33.
These access functions use a backend ID number, which ranges from one
to the number of currently active backends.
The function pg_stat_get_backend_idset
provides a
convenient way to generate one row for each active backend for
invoking these functions. For example, to show the PIDs and
current queries of all backends:
SELECT pg_stat_get_backend_pid(s.backendid) AS pid, pg_stat_get_backend_activity(s.backendid) AS query FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;
Table 28.33. Per-Backend Statistics Functions
Function Description |
---|
Returns the set of currently active backend ID numbers (from 1 to the number of active backends). |
Returns the text of this backend's most recent query. |
Returns the time when the backend's most recent query was started. |
Returns the IP address of the client connected to this backend. |
Returns the TCP port number that the client is using for communication. |
Returns the OID of the database this backend is connected to. |
Returns the process ID of this backend. |
Returns the time when this process was started. |
Returns the OID of the user logged into this backend. |
Returns the wait event type name if this backend is currently waiting, otherwise NULL. See Table 28.4 for details. |
Returns the wait event name if this backend is currently waiting, otherwise NULL. See Table 28.5 through Table 28.13. |
Returns the time when the backend's current transaction was started. |
Another useful tool for monitoring database activity is the
pg_locks
system table. It allows the
database administrator to view information about the outstanding
locks in the lock manager. For example, this capability can be used
to:
View all the locks currently outstanding, all the locks on relations in a particular database, all the locks on a particular relation, or all the locks held by a particular PostgreSQL session.
Determine the relation in the current database with the most ungranted locks (which might be a source of contention among database clients).
Determine the effect of lock contention on overall database performance, as well as the extent to which contention varies with overall database traffic.
Details of the pg_locks
view appear in
Section 52.74.
For more information on locking and managing concurrency with
PostgreSQL, refer to Chapter 13.
PostgreSQL has the ability to report the progress of
certain commands during command execution. Currently, the only commands
which support progress reporting are ANALYZE
,
CLUSTER
,
CREATE INDEX
, VACUUM
,
COPY
,
and BASE_BACKUP (i.e., replication
command that pg_basebackup issues to take
a base backup).
This may be expanded in the future.
Whenever ANALYZE
is running, the
pg_stat_progress_analyze
view will contain a
row for each backend that is currently running that command. The tables
below describe the information that will be reported and provide
information about how to interpret it.
Table 28.34. pg_stat_progress_analyze
View
Column Type Description |
---|
Process ID of backend. |
OID of the database to which this backend is connected. |
Name of the database to which this backend is connected. |
OID of the table being analyzed. |
Current processing phase. See Table 28.35. |
Total number of heap blocks that will be sampled. |
Number of heap blocks scanned. |
Number of extended statistics. |
Number of extended statistics computed. This counter only advances
when the phase is |
Number of child tables. |
Number of child tables scanned. This counter only advances when the
phase is |
OID of the child table currently being scanned. This field is
only valid when the phase is
|
Table 28.35. ANALYZE Phases
Phase | Description |
---|---|
initializing | The command is preparing to begin scanning the heap. This phase is expected to be very brief. |
acquiring sample rows |
The command is currently scanning the table given by
relid to obtain sample rows.
|
acquiring inherited sample rows |
The command is currently scanning child tables to obtain sample rows.
Columns child_tables_total ,
child_tables_done , and
current_child_table_relid contain the
progress information for this phase.
|
computing statistics | The command is computing statistics from the sample rows obtained during the table scan. |
computing extended statistics | The command is computing extended statistics from the sample rows obtained during the table scan. |
finalizing analyze |
The command is updating pg_class . When this
phase is completed, ANALYZE will end.
|
Note that when ANALYZE
is run on a partitioned table,
all of its partitions are also recursively analyzed.
In that case, ANALYZE
progress is reported first for the parent table, whereby its inheritance
statistics are collected, followed by that for each partition.
Whenever CREATE INDEX
or REINDEX
is running, the
pg_stat_progress_create_index
view will contain
one row for each backend that is currently creating indexes. The tables
below describe the information that will be reported and provide information
about how to interpret it.
Table 28.36. pg_stat_progress_create_index
View
Column Type Description |
---|
Process ID of backend. |
OID of the database to which this backend is connected. |
Name of the database to which this backend is connected. |
OID of the table on which the index is being created. |
OID of the index being created or reindexed. During a
non-concurrent |
The command that is running: |
Current processing phase of index creation. See Table 28.37. |
Total number of lockers to wait for, when applicable. |
Number of lockers already waited for. |
Process ID of the locker currently being waited for. |
Total number of blocks to be processed in the current phase. |
Number of blocks already processed in the current phase. |
Total number of tuples to be processed in the current phase. |
Number of tuples already processed in the current phase. |
When creating an index on a partitioned table, this column is set to
the total number of partitions on which the index is to be created.
This field is |
When creating an index on a partitioned table, this column is set to
the number of partitions on which the index has been created.
This field is |
Table 28.37. CREATE INDEX Phases
Phase | Description |
---|---|
initializing |
CREATE INDEX or REINDEX is preparing to create the index. This
phase is expected to be very brief.
|
waiting for writers before build |
CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY is waiting for transactions
with write locks that can potentially see the table to finish.
This phase is skipped when not in concurrent mode.
Columns lockers_total , lockers_done
and current_locker_pid contain the progress
information for this phase.
|
building index |
The index is being built by the access method-specific code. In this phase,
access methods that support progress reporting fill in their own progress data,
and the subphase is indicated in this column. Typically,
blocks_total and blocks_done
will contain progress data, as well as potentially
tuples_total and tuples_done .
|
waiting for writers before validation |
CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY is waiting for transactions
with write locks that can potentially write into the table to finish.
This phase is skipped when not in concurrent mode.
Columns lockers_total , lockers_done
and current_locker_pid contain the progress
information for this phase.
|
index validation: scanning index |
CREATE INDEX CONCURRENTLY is scanning the index searching
for tuples that need to be validated.
This phase is skipped when not in concurrent mode.
Columns blocks_total (set to the total size of the index)
and blocks_done contain the progress information for this phase.
|
index validation: sorting tuples |
CREATE INDEX CONCURRENTLY is sorting the output of the
index scanning phase.
|
index validation: scanning table |
CREATE INDEX CONCURRENTLY is scanning the table
to validate the index tuples collected in the previous two phases.
This phase is skipped when not in concurrent mode.
Columns blocks_total (set to the total size of the table)
and blocks_done contain the progress information for this phase.
|
waiting for old snapshots |
CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY is waiting for transactions
that can potentially see the table to release their snapshots. This
phase is skipped when not in concurrent mode.
Columns lockers_total , lockers_done
and current_locker_pid contain the progress
information for this phase.
|
waiting for readers before marking dead |
REINDEX CONCURRENTLY is waiting for transactions
with read locks on the table to finish, before marking the old index dead.
This phase is skipped when not in concurrent mode.
Columns lockers_total , lockers_done
and current_locker_pid contain the progress
information for this phase.
|
waiting for readers before dropping |
REINDEX CONCURRENTLY is waiting for transactions
with read locks on the table to finish, before dropping the old index.
This phase is skipped when not in concurrent mode.
Columns lockers_total , lockers_done
and current_locker_pid contain the progress
information for this phase.
|
Whenever VACUUM
is running, the
pg_stat_progress_vacuum
view will contain
one row for each backend (including autovacuum worker processes) that is
currently vacuuming. The tables below describe the information
that will be reported and provide information about how to interpret it.
Progress for VACUUM FULL
commands is reported via
pg_stat_progress_cluster
because both VACUUM FULL
and CLUSTER
rewrite the table, while regular VACUUM
only modifies it
in place. See Section 28.4.4.
Table 28.38. pg_stat_progress_vacuum
View
Column Type Description |
---|
Process ID of backend. |
OID of the database to which this backend is connected. |
Name of the database to which this backend is connected. |
OID of the table being vacuumed. |
Current processing phase of vacuum. See Table 28.39. |
Total number of heap blocks in the table. This number is reported
as of the beginning of the scan; blocks added later will not be (and
need not be) visited by this |
Number of heap blocks scanned. Because the
visibility map is used to optimize scans,
some blocks will be skipped without inspection; skipped blocks are
included in this total, so that this number will eventually become
equal to |
Number of heap blocks vacuumed. Unless the table has no indexes, this
counter only advances when the phase is |
Number of completed index vacuum cycles. |
Number of dead tuples that we can store before needing to perform an index vacuum cycle, based on maintenance_work_mem. |
Number of dead tuples collected since the last index vacuum cycle. |
Table 28.39. VACUUM Phases
Phase | Description |
---|---|
initializing |
VACUUM is preparing to begin scanning the heap. This
phase is expected to be very brief.
|
scanning heap |
VACUUM is currently scanning the heap. It will prune and
defragment each page if required, and possibly perform freezing
activity. The heap_blks_scanned column can be used
to monitor the progress of the scan.
|
vacuuming indexes |
VACUUM is currently vacuuming the indexes. If a table has
any indexes, this will happen at least once per vacuum, after the heap
has been completely scanned. It may happen multiple times per vacuum
if maintenance_work_mem (or, in the case of autovacuum,
autovacuum_work_mem if set) is insufficient to store
the number of dead tuples found.
|
vacuuming heap |
VACUUM is currently vacuuming the heap. Vacuuming the heap
is distinct from scanning the heap, and occurs after each instance of
vacuuming indexes. If heap_blks_scanned is less than
heap_blks_total , the system will return to scanning
the heap after this phase is completed; otherwise, it will begin
cleaning up indexes after this phase is completed.
|
cleaning up indexes |
VACUUM is currently cleaning up indexes. This occurs after
the heap has been completely scanned and all vacuuming of the indexes
and the heap has been completed.
|
truncating heap |
VACUUM is currently truncating the heap so as to return
empty pages at the end of the relation to the operating system. This
occurs after cleaning up indexes.
|
performing final cleanup |
VACUUM is performing final cleanup. During this phase,
VACUUM will vacuum the free space map, update statistics
in pg_class , and report statistics to the statistics
collector. When this phase is completed, VACUUM will end.
|
Whenever CLUSTER
or VACUUM FULL
is
running, the pg_stat_progress_cluster
view will
contain a row for each backend that is currently running either command.
The tables below describe the information that will be reported and
provide information about how to interpret it.
Table 28.40. pg_stat_progress_cluster
View
Column Type Description |
---|
Process ID of backend. |
OID of the database to which this backend is connected. |
Name of the database to which this backend is connected. |
OID of the table being clustered. |
The command that is running. Either |
Current processing phase. See Table 28.41. |
If the table is being scanned using an index, this is the OID of the index being used; otherwise, it is zero. |
Number of heap tuples scanned.
This counter only advances when the phase is
|
Number of heap tuples written.
This counter only advances when the phase is
|
Total number of heap blocks in the table. This number is reported
as of the beginning of |
Number of heap blocks scanned. This counter only advances when the
phase is |
Number of indexes rebuilt. This counter only advances when the phase
is |
Table 28.41. CLUSTER and VACUUM FULL Phases
Phase | Description |
---|---|
initializing | The command is preparing to begin scanning the heap. This phase is expected to be very brief. |
seq scanning heap | The command is currently scanning the table using a sequential scan. |
index scanning heap |
CLUSTER is currently scanning the table using an index scan.
|
sorting tuples |
CLUSTER is currently sorting tuples.
|
writing new heap |
CLUSTER is currently writing the new heap.
|
swapping relation files | The command is currently swapping newly-built files into place. |
rebuilding index | The command is currently rebuilding an index. |
performing final cleanup |
The command is performing final cleanup. When this phase is
completed, CLUSTER
or VACUUM FULL will end.
|
Whenever an application like pg_basebackup
is taking a base backup, the
pg_stat_progress_basebackup
view will contain a row for each WAL sender process that is currently
running the BASE_BACKUP
replication command
and streaming the backup. The tables below describe the information
that will be reported and provide information about how to interpret it.
Table 28.42. pg_stat_progress_basebackup
View
Column Type Description |
---|
Process ID of a WAL sender process. |
Current processing phase. See Table 28.43. |
Total amount of data that will be streamed. This is estimated and
reported as of the beginning of
|
Amount of data streamed. This counter only advances
when the phase is |
Total number of tablespaces that will be streamed. |
Number of tablespaces streamed. This counter only
advances when the phase is |
Table 28.43. Base Backup Phases
Phase | Description |
---|---|
initializing | The WAL sender process is preparing to begin the backup. This phase is expected to be very brief. |
waiting for checkpoint to finish |
The WAL sender process is currently performing
pg_start_backup to prepare to
take a base backup, and waiting for the start-of-backup
checkpoint to finish.
|
estimating backup size | The WAL sender process is currently estimating the total amount of database files that will be streamed as a base backup. |
streaming database files | The WAL sender process is currently streaming database files as a base backup. |
waiting for wal archiving to finish |
The WAL sender process is currently performing
pg_stop_backup to finish the backup,
and waiting for all the WAL files required for the base backup
to be successfully archived.
If either --wal-method=none or
--wal-method=stream is specified in
pg_basebackup, the backup will end
when this phase is completed.
|
transferring wal files |
The WAL sender process is currently transferring all WAL logs
generated during the backup. This phase occurs after
waiting for wal archiving to finish phase if
--wal-method=fetch is specified in
pg_basebackup. The backup will end
when this phase is completed.
|
Whenever COPY
is running, the
pg_stat_progress_copy
view will contain one row
for each backend that is currently running a COPY
command.
The table below describes the information that will be reported and provides
information about how to interpret it.
Table 28.44. pg_stat_progress_copy
View
Column Type Description |
---|
Process ID of backend. |
OID of the database to which this backend is connected. |
Name of the database to which this backend is connected. |
OID of the table on which the |
The command that is running: |
The io type that the data is read from or written to:
|
Number of bytes already processed by |
Size of source file for |
Number of tuples already processed by |
Number of tuples not processed because they were excluded by the
|
PostgreSQL provides facilities to support dynamic tracing of the database server. This allows an external utility to be called at specific points in the code and thereby trace execution.
A number of probes or trace points are already inserted into the source code. These probes are intended to be used by database developers and administrators. By default the probes are not compiled into PostgreSQL; the user needs to explicitly tell the configure script to make the probes available.
Currently, the
DTrace
utility is supported, which, at the time of this writing, is available
on Solaris, macOS, FreeBSD, NetBSD, and Oracle Linux. The
SystemTap project
for Linux provides a DTrace equivalent and can also be used. Supporting other dynamic
tracing utilities is theoretically possible by changing the definitions for
the macros in src/include/utils/probes.h
.
By default, probes are not available, so you will need to
explicitly tell the configure script to make the probes available
in PostgreSQL. To include DTrace support
specify --enable-dtrace
to configure. See Section 17.4 for further information.
A number of standard probes are provided in the source code, as shown in Table 28.45; Table 28.46 shows the types used in the probes. More probes can certainly be added to enhance PostgreSQL's observability.
Table 28.45. Built-in DTrace Probes
Name | Parameters | Description |
---|---|---|
transaction-start | (LocalTransactionId) | Probe that fires at the start of a new transaction. arg0 is the transaction ID. |
transaction-commit | (LocalTransactionId) | Probe that fires when a transaction completes successfully. arg0 is the transaction ID. |
transaction-abort | (LocalTransactionId) | Probe that fires when a transaction completes unsuccessfully. arg0 is the transaction ID. |
query-start | (const char *) | Probe that fires when the processing of a query is started. arg0 is the query string. |
query-done | (const char *) | Probe that fires when the processing of a query is complete. arg0 is the query string. |
query-parse-start | (const char *) | Probe that fires when the parsing of a query is started. arg0 is the query string. |
query-parse-done | (const char *) | Probe that fires when the parsing of a query is complete. arg0 is the query string. |
query-rewrite-start | (const char *) | Probe that fires when the rewriting of a query is started. arg0 is the query string. |
query-rewrite-done | (const char *) | Probe that fires when the rewriting of a query is complete. arg0 is the query string. |
query-plan-start | () | Probe that fires when the planning of a query is started. |
query-plan-done | () | Probe that fires when the planning of a query is complete. |
query-execute-start | () | Probe that fires when the execution of a query is started. |
query-execute-done | () | Probe that fires when the execution of a query is complete. |
statement-status | (const char *) | Probe that fires anytime the server process updates its
pg_stat_activity .status .
arg0 is the new status string. |
checkpoint-start | (int) | Probe that fires when a checkpoint is started. arg0 holds the bitwise flags used to distinguish different checkpoint types, such as shutdown, immediate or force. |
checkpoint-done | (int, int, int, int, int) | Probe that fires when a checkpoint is complete. (The probes listed next fire in sequence during checkpoint processing.) arg0 is the number of buffers written. arg1 is the total number of buffers. arg2, arg3 and arg4 contain the number of WAL files added, removed and recycled respectively. |
clog-checkpoint-start | (bool) | Probe that fires when the CLOG portion of a checkpoint is started. arg0 is true for normal checkpoint, false for shutdown checkpoint. |
clog-checkpoint-done | (bool) | Probe that fires when the CLOG portion of a checkpoint is
complete. arg0 has the same meaning as for clog-checkpoint-start . |
subtrans-checkpoint-start | (bool) | Probe that fires when the SUBTRANS portion of a checkpoint is started. arg0 is true for normal checkpoint, false for shutdown checkpoint. |
subtrans-checkpoint-done | (bool) | Probe that fires when the SUBTRANS portion of a checkpoint is
complete. arg0 has the same meaning as for
subtrans-checkpoint-start . |
multixact-checkpoint-start | (bool) | Probe that fires when the MultiXact portion of a checkpoint is started. arg0 is true for normal checkpoint, false for shutdown checkpoint. |
multixact-checkpoint-done | (bool) | Probe that fires when the MultiXact portion of a checkpoint is
complete. arg0 has the same meaning as for
multixact-checkpoint-start . |
buffer-checkpoint-start | (int) | Probe that fires when the buffer-writing portion of a checkpoint is started. arg0 holds the bitwise flags used to distinguish different checkpoint types, such as shutdown, immediate or force. |
buffer-sync-start | (int, int) | Probe that fires when we begin to write dirty buffers during checkpoint (after identifying which buffers must be written). arg0 is the total number of buffers. arg1 is the number that are currently dirty and need to be written. |
buffer-sync-written | (int) | Probe that fires after each buffer is written during checkpoint. arg0 is the ID number of the buffer. |
buffer-sync-done | (int, int, int) | Probe that fires when all dirty buffers have been written.
arg0 is the total number of buffers.
arg1 is the number of buffers actually written by the checkpoint process.
arg2 is the number that were expected to be written (arg1 of
buffer-sync-start ); any difference reflects other processes flushing
buffers during the checkpoint. |
buffer-checkpoint-sync-start | () | Probe that fires after dirty buffers have been written to the kernel, and before starting to issue fsync requests. |
buffer-checkpoint-done | () | Probe that fires when syncing of buffers to disk is complete. |
twophase-checkpoint-start | () | Probe that fires when the two-phase portion of a checkpoint is started. |
twophase-checkpoint-done | () | Probe that fires when the two-phase portion of a checkpoint is complete. |
buffer-read-start | (ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool) | Probe that fires when a buffer read is started.
arg0 and arg1 contain the fork and block numbers of the page (but
arg1 will be -1 if this is a relation extension request).
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.
arg5 is the ID of the backend which created the temporary relation for a
local buffer, or InvalidBackendId (-1) for a shared buffer.
arg6 is true for a relation extension request, false for normal
read. |
buffer-read-done | (ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool) | Probe that fires when a buffer read is complete.
arg0 and arg1 contain the fork and block numbers of the page (if this
is a relation extension request, arg1 now contains the block number
of the newly added block).
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.
arg5 is the ID of the backend which created the temporary relation for a
local buffer, or InvalidBackendId (-1) for a shared buffer.
arg6 is true for a relation extension request, false for normal
read.
arg7 is true if the buffer was found in the pool, false if not. |
buffer-flush-start | (ForkNumber, BlockNumber, Oid, Oid, Oid) | Probe that fires before issuing any write request for a shared buffer. arg0 and arg1 contain the fork and block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation. |
buffer-flush-done | (ForkNumber, BlockNumber, Oid, Oid, Oid) | Probe that fires when a write request is complete. (Note
that this just reflects the time to pass the data to the kernel;
it's typically not actually been written to disk yet.)
The arguments are the same as for buffer-flush-start . |
buffer-write-dirty-start | (ForkNumber, BlockNumber, Oid, Oid, Oid) | Probe that fires when a server process begins to write a dirty buffer. (If this happens often, it implies that shared_buffers is too small or the background writer control parameters need adjustment.) arg0 and arg1 contain the fork and block numbers of the page. arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs identifying the relation. |
buffer-write-dirty-done | (ForkNumber, BlockNumber, Oid, Oid, Oid) | Probe that fires when a dirty-buffer write is complete.
The arguments are the same as for buffer-write-dirty-start . |
wal-buffer-write-dirty-start | () | Probe that fires when a server process begins to write a dirty WAL buffer because no more WAL buffer space is available. (If this happens often, it implies that wal_buffers is too small.) |
wal-buffer-write-dirty-done | () | Probe that fires when a dirty WAL buffer write is complete. |
wal-insert | (unsigned char, unsigned char) | Probe that fires when a WAL record is inserted. arg0 is the resource manager (rmid) for the record. arg1 contains the info flags. |
wal-switch | () | Probe that fires when a WAL segment switch is requested. |
smgr-md-read-start | (ForkNumber, BlockNumber, Oid, Oid, Oid, int) | Probe that fires when beginning to read a block from a relation.
arg0 and arg1 contain the fork and block numbers of the page.
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.
arg5 is the ID of the backend which created the temporary relation for a
local buffer, or InvalidBackendId (-1) for a shared buffer. |
smgr-md-read-done | (ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int) | Probe that fires when a block read is complete.
arg0 and arg1 contain the fork and block numbers of the page.
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.
arg5 is the ID of the backend which created the temporary relation for a
local buffer, or InvalidBackendId (-1) for a shared buffer.
arg6 is the number of bytes actually read, while arg7 is the number
requested (if these are different it indicates trouble). |
smgr-md-write-start | (ForkNumber, BlockNumber, Oid, Oid, Oid, int) | Probe that fires when beginning to write a block to a relation.
arg0 and arg1 contain the fork and block numbers of the page.
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.
arg5 is the ID of the backend which created the temporary relation for a
local buffer, or InvalidBackendId (-1) for a shared buffer. |
smgr-md-write-done | (ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int) | Probe that fires when a block write is complete.
arg0 and arg1 contain the fork and block numbers of the page.
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.
arg5 is the ID of the backend which created the temporary relation for a
local buffer, or InvalidBackendId (-1) for a shared buffer.
arg6 is the number of bytes actually written, while arg7 is the number
requested (if these are different it indicates trouble). |
sort-start | (int, bool, int, int, bool, int) | Probe that fires when a sort operation is started.
arg0 indicates heap, index or datum sort.
arg1 is true for unique-value enforcement.
arg2 is the number of key columns.
arg3 is the number of kilobytes of work memory allowed.
arg4 is true if random access to the sort result is required.
arg5 indicates serial when 0 , parallel worker when
1 , or parallel leader when 2 . |
sort-done | (bool, long) | Probe that fires when a sort is complete. arg0 is true for external sort, false for internal sort. arg1 is the number of disk blocks used for an external sort, or kilobytes of memory used for an internal sort. |
lwlock-acquire | (char *, LWLockMode) | Probe that fires when an LWLock has been acquired. arg0 is the LWLock's tranche. arg1 is the requested lock mode, either exclusive or shared. |
lwlock-release | (char *) | Probe that fires when an LWLock has been released (but note that any released waiters have not yet been awakened). arg0 is the LWLock's tranche. |
lwlock-wait-start | (char *, LWLockMode) | Probe that fires when an LWLock was not immediately available and a server process has begun to wait for the lock to become available. arg0 is the LWLock's tranche. arg1 is the requested lock mode, either exclusive or shared. |
lwlock-wait-done | (char *, LWLockMode) | Probe that fires when a server process has been released from its wait for an LWLock (it does not actually have the lock yet). arg0 is the LWLock's tranche. arg1 is the requested lock mode, either exclusive or shared. |
lwlock-condacquire | (char *, LWLockMode) | Probe that fires when an LWLock was successfully acquired when the caller specified no waiting. arg0 is the LWLock's tranche. arg1 is the requested lock mode, either exclusive or shared. |
lwlock-condacquire-fail | (char *, LWLockMode) | Probe that fires when an LWLock was not successfully acquired when the caller specified no waiting. arg0 is the LWLock's tranche. arg1 is the requested lock mode, either exclusive or shared. |
lock-wait-start | (unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE) | Probe that fires when a request for a heavyweight lock (lmgr lock) has begun to wait because the lock is not available. arg0 through arg3 are the tag fields identifying the object being locked. arg4 indicates the type of object being locked. arg5 indicates the lock type being requested. |
lock-wait-done | (unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, LOCKMODE) | Probe that fires when a request for a heavyweight lock (lmgr lock)
has finished waiting (i.e., has acquired the lock).
The arguments are the same as for lock-wait-start . |
deadlock-found | () | Probe that fires when a deadlock is found by the deadlock detector. |
Table 28.46. Defined Types Used in Probe Parameters
Type | Definition |
---|---|
LocalTransactionId | unsigned int |
LWLockMode | int |
LOCKMODE | int |
BlockNumber | unsigned int |
Oid | unsigned int |
ForkNumber | int |
bool | unsigned char |
The example below shows a DTrace script for analyzing transaction
counts in the system, as an alternative to snapshotting
pg_stat_database
before and after a performance test:
#!/usr/sbin/dtrace -qs postgresql$1:::transaction-start { @start["Start"] = count(); self->ts = timestamp; } postgresql$1:::transaction-abort { @abort["Abort"] = count(); } postgresql$1:::transaction-commit /self->ts/ { @commit["Commit"] = count(); @time["Total time (ns)"] = sum(timestamp - self->ts); self->ts=0; }
When executed, the example D script gives output such as:
# ./txn_count.d `pgrep -n postgres` or ./txn_count.d <PID> ^C Start 71 Commit 70 Total time (ns) 2312105013
SystemTap uses a different notation for trace scripts than DTrace does, even though the underlying trace points are compatible. One point worth noting is that at this writing, SystemTap scripts must reference probe names using double underscores in place of hyphens. This is expected to be fixed in future SystemTap releases.
You should remember that DTrace scripts need to be carefully written and debugged, otherwise the trace information collected might be meaningless. In most cases where problems are found it is the instrumentation that is at fault, not the underlying system. When discussing information found using dynamic tracing, be sure to enclose the script used to allow that too to be checked and discussed.
New probes can be defined within the code wherever the developer desires, though this will require a recompilation. Below are the steps for inserting new probes:
Decide on probe names and data to be made available through the probes
Add the probe definitions to src/backend/utils/probes.d
Include pg_trace.h
if it is not already present in the
module(s) containing the probe points, and insert
TRACE_POSTGRESQL
probe macros at the desired locations
in the source code
Recompile and verify that the new probes are available
Example: Here is an example of how you would add a probe to trace all new transactions by transaction ID.
Decide that the probe will be named transaction-start
and
requires a parameter of type LocalTransactionId
Add the probe definition to src/backend/utils/probes.d
:
probe transaction__start(LocalTransactionId);
Note the use of the double underline in the probe name. In a DTrace
script using the probe, the double underline needs to be replaced with a
hyphen, so transaction-start
is the name to document for
users.
At compile time, transaction__start
is converted to a macro
called TRACE_POSTGRESQL_TRANSACTION_START
(notice the
underscores are single here), which is available by including
pg_trace.h
. Add the macro call to the appropriate location
in the source code. In this case, it looks like the following:
TRACE_POSTGRESQL_TRANSACTION_START(vxid.localTransactionId);
After recompiling and running the new binary, check that your newly added probe is available by executing the following DTrace command. You should see similar output:
# dtrace -ln transaction-start ID PROVIDER MODULE FUNCTION NAME 18705 postgresql49878 postgres StartTransactionCommand transaction-start 18755 postgresql49877 postgres StartTransactionCommand transaction-start 18805 postgresql49876 postgres StartTransactionCommand transaction-start 18855 postgresql49875 postgres StartTransactionCommand transaction-start 18986 postgresql49873 postgres StartTransactionCommand transaction-start
There are a few things to be careful about when adding trace macros to the C code:
You should take care that the data types specified for a probe's parameters match the data types of the variables used in the macro. Otherwise, you will get compilation errors.
On most platforms, if PostgreSQL is
built with --enable-dtrace
, the arguments to a trace
macro will be evaluated whenever control passes through the
macro, even if no tracing is being done. This is
usually not worth worrying about if you are just reporting the
values of a few local variables. But beware of putting expensive
function calls into the arguments. If you need to do that,
consider protecting the macro with a check to see if the trace
is actually enabled:
if (TRACE_POSTGRESQL_TRANSACTION_START_ENABLED()) TRACE_POSTGRESQL_TRANSACTION_START(some_function(...));
Each trace macro has a corresponding ENABLED
macro.
Table of Contents
This chapter discusses how to monitor the disk usage of a PostgreSQL database system.
Each table has a primary heap disk file where most of the data is stored. If the table has any columns with potentially-wide values, there also might be a TOAST file associated with the table, which is used to store values too wide to fit comfortably in the main table (see Section 70.2). There will be one valid index on the TOAST table, if present. There also might be indexes associated with the base table. Each table and index is stored in a separate disk file — possibly more than one file, if the file would exceed one gigabyte. Naming conventions for these files are described in Section 70.1.
You can monitor disk space in three ways: using the SQL functions listed in Table 9.92, using the oid2name module, or using manual inspection of the system catalogs. The SQL functions are the easiest to use and are generally recommended. The remainder of this section shows how to do it by inspection of the system catalogs.
Using psql on a recently vacuumed or analyzed database, you can issue queries to see the disk usage of any table:
SELECT pg_relation_filepath(oid), relpages FROM pg_class WHERE relname = 'customer'; pg_relation_filepath | relpages ----------------------+---------- base/16384/16806 | 60 (1 row)
Each page is typically 8 kilobytes. (Remember, relpages
is only updated by VACUUM
, ANALYZE
, and
a few DDL commands such as CREATE INDEX
.) The file path name
is of interest if you want to examine the table's disk file directly.
To show the space used by TOAST tables, use a query like the following:
SELECT relname, relpages FROM pg_class, (SELECT reltoastrelid FROM pg_class WHERE relname = 'customer') AS ss WHERE oid = ss.reltoastrelid OR oid = (SELECT indexrelid FROM pg_index WHERE indrelid = ss.reltoastrelid) ORDER BY relname; relname | relpages ----------------------+---------- pg_toast_16806 | 0 pg_toast_16806_index | 1
You can easily display index sizes, too:
SELECT c2.relname, c2.relpages FROM pg_class c, pg_class c2, pg_index i WHERE c.relname = 'customer' AND c.oid = i.indrelid AND c2.oid = i.indexrelid ORDER BY c2.relname; relname | relpages -------------------+---------- customer_id_index | 26
It is easy to find your largest tables and indexes using this information:
SELECT relname, relpages FROM pg_class ORDER BY relpages DESC; relname | relpages ----------------------+---------- bigtable | 3290 customer | 3144
The most important disk monitoring task of a database administrator is to make sure the disk doesn't become full. A filled data disk will not result in data corruption, but it might prevent useful activity from occurring. If the disk holding the WAL files grows full, database server panic and consequent shutdown might occur.
If you cannot free up additional space on the disk by deleting other things, you can move some of the database files to other file systems by making use of tablespaces. See Section 23.6 for more information about that.
Some file systems perform badly when they are almost full, so do not wait until the disk is completely full to take action.
If your system supports per-user disk quotas, then the database will naturally be subject to whatever quota is placed on the user the server runs as. Exceeding the quota will have the same bad effects as running out of disk space entirely.
Table of Contents
This chapter explains how the Write-Ahead Log is used to obtain efficient, reliable operation.
Reliability is an important property of any serious database system, and PostgreSQL does everything possible to guarantee reliable operation. One aspect of reliable operation is that all data recorded by a committed transaction should be stored in a nonvolatile area that is safe from power loss, operating system failure, and hardware failure (except failure of the nonvolatile area itself, of course). Successfully writing the data to the computer's permanent storage (disk drive or equivalent) ordinarily meets this requirement. In fact, even if a computer is fatally damaged, if the disk drives survive they can be moved to another computer with similar hardware and all committed transactions will remain intact.
While forcing data to the disk platters periodically might seem like a simple operation, it is not. Because disk drives are dramatically slower than main memory and CPUs, several layers of caching exist between the computer's main memory and the disk platters. First, there is the operating system's buffer cache, which caches frequently requested disk blocks and combines disk writes. Fortunately, all operating systems give applications a way to force writes from the buffer cache to disk, and PostgreSQL uses those features. (See the wal_sync_method parameter to adjust how this is done.)
Next, there might be a cache in the disk drive controller; this is particularly common on RAID controller cards. Some of these caches are write-through, meaning writes are sent to the drive as soon as they arrive. Others are write-back, meaning data is sent to the drive at some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have battery-backup units (BBUs), meaning the card has a battery that maintains power to the cache in case of system power loss. After power is restored the data will be written to the disk drives.
And finally, most disk drives have caches. Some are write-through while some are write-back, and the same concerns about data loss exist for write-back drive caches as for disk controller caches. Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not survive a power failure. Many solid-state drives (SSD) also have volatile write-back caches.
These caches can typically be disabled; however, the method for doing this varies by operating system and drive type:
On Linux, IDE and SATA drives can be queried using
hdparm -I
; write caching is enabled if there is
a *
next to Write cache
. hdparm -W 0
can be used to turn off write caching. SCSI drives can be queried
using sdparm.
Use sdparm --get=WCE
to check
whether the write cache is enabled and sdparm --clear=WCE
to disable it.
On FreeBSD, IDE drives can be queried using
camcontrol identify
and write caching turned off using
hw.ata.wc=0
in /boot/loader.conf
;
SCSI drives can be queried using camcontrol identify
,
and the write cache both queried and changed using
sdparm
when available.
On Solaris, the disk write cache is controlled by
format -e
.
(The Solaris ZFS file system is safe with disk write-cache
enabled because it issues its own disk cache flush commands.)
On Windows, if wal_sync_method
is
open_datasync
(the default), write caching can be disabled
by unchecking My Computer\Open\
.
Alternatively, set disk drive
\Properties\Hardware\Properties\Policies\Enable write caching on the diskwal_sync_method
to
fsync
or fsync_writethrough
, which prevent
write caching.
On macOS, write caching can be prevented by
setting wal_sync_method
to fsync_writethrough
.
Recent SATA drives (those following ATAPI-6 or later)
offer a drive cache flush command (FLUSH CACHE EXT
),
while SCSI drives have long supported a similar command
SYNCHRONIZE CACHE
. These commands are not directly
accessible to PostgreSQL, but some file systems
(e.g., ZFS, ext4) can use them to flush
data to the platters on write-back-enabled drives. Unfortunately, such
file systems behave suboptimally when combined with battery-backup unit
(BBU) disk controllers. In such setups, the synchronize
command forces all data from the controller cache to the disks,
eliminating much of the benefit of the BBU. You can run the
pg_test_fsync program to see
if you are affected. If you are affected, the performance benefits
of the BBU can be regained by turning off write barriers in
the file system or reconfiguring the disk controller, if that is
an option. If write barriers are turned off, make sure the battery
remains functional; a faulty battery can potentially lead to data loss.
Hopefully file system and disk controller designers will eventually
address this suboptimal behavior.
When the operating system sends a write request to the storage hardware,
there is little it can do to make sure the data has arrived at a truly
non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage components
ensure integrity for both data and file-system metadata.
Avoid disk controllers that have non-battery-backed write caches.
At the drive level, disable write-back caching if the
drive cannot guarantee the data will be written before shutdown.
If you use SSDs, be aware that many of these do not honor cache flush
commands by default.
You can test for reliable I/O subsystem behavior using diskchecker.pl
.
Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are divided into sectors, commonly 512 bytes each. Every physical read or write operation processes a whole sector. When a write request arrives at the drive, it might be for some multiple of 512 bytes (PostgreSQL typically writes 8192 bytes, or 16 sectors, at a time), and the process of writing could fail due to power loss at any time, meaning some of the 512-byte sectors were written while others were not. To guard against such failures, PostgreSQL periodically writes full page images to permanent WAL storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQL can restore partially-written pages from WAL. If you have file-system software that prevents partial page writes (e.g., ZFS), you can turn off this page imaging by turning off the full_page_writes parameter. Battery-Backed Unit (BBU) disk controllers do not prevent partial page writes unless they guarantee that data is written to the BBU as full (8kB) pages.
PostgreSQL also protects against some kinds of data corruption on storage devices that may occur because of hardware errors or media failure over time, such as reading/writing garbage data.
Each individual record in a WAL file is protected by a CRC-32 (32-bit) check that allows us to tell if record contents are correct. The CRC value is set when we write each WAL record and checked during crash recovery, archive recovery and replication.
Data pages are not currently checksummed by default, though full page images recorded in WAL records will be protected; see initdb for details about enabling data checksums.
Internal data structures such as pg_xact
, pg_subtrans
, pg_multixact
,
pg_serial
, pg_notify
, pg_stat
, pg_snapshots
are not directly
checksummed, nor are pages protected by full page writes. However, where
such data structures are persistent, WAL records are written that allow
recent changes to be accurately rebuilt at crash recovery and those
WAL records are protected as discussed above.
Individual state files in pg_twophase
are protected by CRC-32.
Temporary data files used in larger SQL queries for sorts, materializations and intermediate results are not currently checksummed, nor will WAL records be written for changes to those files.
PostgreSQL does not protect against correctable memory errors and it is assumed you will operate using RAM that uses industry standard Error Correcting Codes (ECC) or better protection.
By default, data pages are not protected by checksums, but this can optionally be enabled for a cluster. When enabled, each data page includes a checksum that is updated when the page is written and verified each time the page is read. Only data pages are protected by checksums; internal data structures and temporary files are not.
Checksums are normally enabled when the cluster is initialized using initdb. They can also be enabled or disabled at a later time as an offline operation. Data checksums are enabled or disabled at the full cluster level, and cannot be specified individually for databases or tables.
The current state of checksums in the cluster can be verified by viewing the
value of the read-only configuration variable data_checksums by issuing the command SHOW
data_checksums
.
When attempting to recover from page corruptions, it may be necessary to bypass the checksum protection. To do this, temporarily set the configuration parameter ignore_checksum_failure.
The pg_checksums application can be used to enable or disable data checksums, as well as verify checksums, on an offline cluster.
Write-Ahead Logging (WAL) is a standard method for ensuring data integrity. A detailed description can be found in most (if not all) books about transaction processing. Briefly, WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages can be redone from the log records. (This is roll-forward recovery, also known as REDO.)
Because WAL restores database file
contents after a crash, journaled file systems are not necessary for
reliable storage of the data files or WAL files. In fact, journaling
overhead can reduce performance, especially if journaling
causes file system data to be flushed
to disk. Fortunately, data flushing during journaling can
often be disabled with a file system mount option, e.g.,
data=writeback
on a Linux ext3 file system.
Journaled file systems do improve boot speed after a crash.
Using WAL results in a
significantly reduced number of disk writes, because only the log
file needs to be flushed to disk to guarantee that a transaction is
committed, rather than every data file changed by the transaction.
The log file is written sequentially,
and so the cost of syncing the log is much less than the cost of
flushing the data pages. This is especially true for servers
handling many small transactions touching different parts of the data
store. Furthermore, when the server is processing many small concurrent
transactions, one fsync
of the log file may
suffice to commit many transactions.
WAL also makes it possible to support on-line backup and point-in-time recovery, as described in Section 26.3. By archiving the WAL data we can support reverting to any time instant covered by the available WAL data: we simply install a prior physical backup of the database, and replay the WAL log just as far as the desired time. What's more, the physical backup doesn't have to be an instantaneous snapshot of the database state — if it is made over some period of time, then replaying the WAL log for that period will fix any internal inconsistencies.
Asynchronous commit is an option that allows transactions to complete more quickly, at the cost that the most recent transactions may be lost if the database should crash. In many applications this is an acceptable trade-off.
As described in the previous section, transaction commit is normally synchronous: the server waits for the transaction's WAL records to be flushed to permanent storage before returning a success indication to the client. The client is therefore guaranteed that a transaction reported to be committed will be preserved, even in the event of a server crash immediately after. However, for short transactions this delay is a major component of the total transaction time. Selecting asynchronous commit mode means that the server returns success as soon as the transaction is logically completed, before the WAL records it generated have actually made their way to disk. This can provide a significant boost in throughput for small transactions.
Asynchronous commit introduces the risk of data loss. There is a short time window between the report of transaction completion to the client and the time that the transaction is truly committed (that is, it is guaranteed not to be lost if the server crashes). Thus asynchronous commit should not be used if the client will take external actions relying on the assumption that the transaction will be remembered. As an example, a bank would certainly not use asynchronous commit for a transaction recording an ATM's dispensing of cash. But in many scenarios, such as event logging, there is no need for a strong guarantee of this kind.
The risk that is taken by using asynchronous commit is of data loss, not data corruption. If the database should crash, it will recover by replaying WAL up to the last record that was flushed. The database will therefore be restored to a self-consistent state, but any transactions that were not yet flushed to disk will not be reflected in that state. The net effect is therefore loss of the last few transactions. Because the transactions are replayed in commit order, no inconsistency can be introduced — for example, if transaction B made changes relying on the effects of a previous transaction A, it is not possible for A's effects to be lost while B's effects are preserved.
The user can select the commit mode of each transaction, so that
it is possible to have both synchronous and asynchronous commit
transactions running concurrently. This allows flexible trade-offs
between performance and certainty of transaction durability.
The commit mode is controlled by the user-settable parameter
synchronous_commit, which can be changed in any of
the ways that a configuration parameter can be set. The mode used for
any one transaction depends on the value of
synchronous_commit
when transaction commit begins.
Certain utility commands, for instance DROP TABLE
, are
forced to commit synchronously regardless of the setting of
synchronous_commit
. This is to ensure consistency
between the server's file system and the logical state of the database.
The commands supporting two-phase commit, such as PREPARE
TRANSACTION
, are also always synchronous.
If the database crashes during the risk window between an
asynchronous commit and the writing of the transaction's
WAL records,
then changes made during that transaction will be lost.
The duration of the
risk window is limited because a background process (the “WAL
writer”) flushes unwritten WAL records to disk
every wal_writer_delay milliseconds.
The actual maximum duration of the risk window is three times
wal_writer_delay
because the WAL writer is
designed to favor writing whole pages at a time during busy periods.
An immediate-mode shutdown is equivalent to a server crash, and will therefore cause loss of any unflushed asynchronous commits.
Asynchronous commit provides behavior different from setting
fsync = off.
fsync
is a server-wide
setting that will alter the behavior of all transactions. It disables
all logic within PostgreSQL that attempts to synchronize
writes to different portions of the database, and therefore a system
crash (that is, a hardware or operating system crash, not a failure of
PostgreSQL itself) could result in arbitrarily bad
corruption of the database state. In many scenarios, asynchronous
commit provides most of the performance improvement that could be
obtained by turning off fsync
, but without the risk
of data corruption.
commit_delay also sounds very similar to
asynchronous commit, but it is actually a synchronous commit method
(in fact, commit_delay
is ignored during an
asynchronous commit). commit_delay
causes a delay
just before a transaction flushes WAL to disk, in
the hope that a single flush executed by one such transaction can also
serve other transactions committing at about the same time. The
setting can be thought of as a way of increasing the time window in
which transactions can join a group about to participate in a single
flush, to amortize the cost of the flush among multiple transactions.
There are several WAL-related configuration parameters that affect database performance. This section explains their use. Consult Chapter 20 for general information about setting server configuration parameters.
Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. (The change records were previously flushed to the WAL files.) In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk. Hence, after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed. (When WAL archiving is being done, the log segments must be archived before being recycled or removed.)
The checkpoint requirement of flushing all dirty data pages to disk can cause a significant I/O load. For this reason, checkpoint activity is throttled so that I/O begins at checkpoint start and completes before the next checkpoint is due to start; this minimizes performance degradation during checkpoints.
The server's checkpointer process automatically performs
a checkpoint every so often. A checkpoint is begun every checkpoint_timeout seconds, or if
max_wal_size is about to be exceeded,
whichever comes first.
The default settings are 5 minutes and 1 GB, respectively.
If no WAL has been written since the previous checkpoint, new checkpoints
will be skipped even if checkpoint_timeout
has passed.
(If WAL archiving is being used and you want to put a lower limit on how
often files are archived in order to bound potential data loss, you should
adjust the archive_timeout parameter rather than the
checkpoint parameters.)
It is also possible to force a checkpoint by using the SQL
command CHECKPOINT
.
Reducing checkpoint_timeout
and/or
max_wal_size
causes checkpoints to occur
more often. This allows faster after-crash recovery, since less work
will need to be redone. However, one must balance this against the
increased cost of flushing dirty data pages more often. If
full_page_writes is set (as is the default), there is
another factor to consider. To ensure data page consistency,
the first modification of a data page after each checkpoint results in
logging the entire page content. In that case,
a smaller checkpoint interval increases the volume of output to the WAL log,
partially negating the goal of using a smaller interval,
and in any case causing more disk I/O.
Checkpoints are fairly expensive, first because they require writing
out all currently dirty buffers, and second because they result in
extra subsequent WAL traffic as discussed above. It is therefore
wise to set the checkpointing parameters high enough so that checkpoints
don't happen too often. As a simple sanity check on your checkpointing
parameters, you can set the checkpoint_warning
parameter. If checkpoints happen closer together than
checkpoint_warning
seconds,
a message will be output to the server log recommending increasing
max_wal_size
. Occasional appearance of such
a message is not cause for alarm, but if it appears often then the
checkpoint control parameters should be increased. Bulk operations such
as large COPY
transfers might cause a number of such warnings
to appear if you have not set max_wal_size
high
enough.
To avoid flooding the I/O system with a burst of page writes,
writing dirty buffers during a checkpoint is spread over a period of time.
That period is controlled by
checkpoint_completion_target, which is
given as a fraction of the checkpoint interval (configured by using
checkpoint_timeout
).
The I/O rate is adjusted so that the checkpoint finishes when the
given fraction of
checkpoint_timeout
seconds have elapsed, or before
max_wal_size
is exceeded, whichever is sooner.
With the default value of 0.9,
PostgreSQL can be expected to complete each checkpoint
a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's
duration). This spreads out the I/O as much as possible so that the checkpoint
I/O load is consistent throughout the checkpoint interval. The disadvantage of
this is that prolonging checkpoints affects recovery time, because more WAL
segments will need to be kept around for possible use in recovery. A user
concerned about the amount of time required to recover might wish to reduce
checkpoint_timeout
so that checkpoints occur more frequently
but still spread the I/O across the checkpoint interval. Alternatively,
checkpoint_completion_target
could be reduced, but this would
result in times of more intense I/O (during the checkpoint) and times of less I/O
(after the checkpoint completed but before the next scheduled checkpoint) and
therefore is not recommended.
Although checkpoint_completion_target
could be set as high as
1.0, it is typically recommended to set it to no higher than 0.9 (the default)
since checkpoints include some other activities besides writing dirty buffers.
A setting of 1.0 is quite likely to result in checkpoints not being
completed on time, which would result in performance loss due to
unexpected variation in the number of WAL segments needed.
On Linux and POSIX platforms checkpoint_flush_after
allows to force the OS that pages written by the checkpoint should be
flushed to disk after a configurable number of bytes. Otherwise, these
pages may be kept in the OS's page cache, inducing a stall when
fsync
is issued at the end of a checkpoint. This setting will
often help to reduce transaction latency, but it also can have an adverse
effect on performance; particularly for workloads that are bigger than
shared_buffers, but smaller than the OS's page cache.
The number of WAL segment files in pg_wal
directory depends on
min_wal_size
, max_wal_size
and
the amount of WAL generated in previous checkpoint cycles. When old log
segment files are no longer needed, they are removed or recycled (that is,
renamed to become future segments in the numbered sequence). If, due to a
short-term peak of log output rate, max_wal_size
is
exceeded, the unneeded segment files will be removed until the system
gets back under this limit. Below that limit, the system recycles enough
WAL files to cover the estimated need until the next checkpoint, and
removes the rest. The estimate is based on a moving average of the number
of WAL files used in previous checkpoint cycles. The moving average
is increased immediately if the actual usage exceeds the estimate, so it
accommodates peak usage rather than average usage to some extent.
min_wal_size
puts a minimum on the amount of WAL files
recycled for future usage; that much WAL is always recycled for future use,
even if the system is idle and the WAL usage estimate suggests that little
WAL is needed.
Independently of max_wal_size
,
the most recent wal_keep_size megabytes of
WAL files plus one additional WAL file are
kept at all times. Also, if WAL archiving is used, old segments cannot be
removed or recycled until they are archived. If WAL archiving cannot keep up
with the pace that WAL is generated, or if archive_command
fails repeatedly, old WAL files will accumulate in pg_wal
until the situation is resolved. A slow or failed standby server that
uses a replication slot will have the same effect (see
Section 27.2.6).
In archive recovery or standby mode, the server periodically performs
restartpoints,
which are similar to checkpoints in normal operation: the server forces
all its state to disk, updates the pg_control
file to
indicate that the already-processed WAL data need not be scanned again,
and then recycles any old log segment files in the pg_wal
directory.
Restartpoints can't be performed more frequently than checkpoints on the
primary because restartpoints can only be performed at checkpoint records.
A restartpoint is triggered when a checkpoint record is reached if at
least checkpoint_timeout
seconds have passed since the last
restartpoint, or if WAL size is about to exceed
max_wal_size
. However, because of limitations on when a
restartpoint can be performed, max_wal_size
is often exceeded
during recovery, by up to one checkpoint cycle's worth of WAL.
(max_wal_size
is never a hard limit anyway, so you should
always leave plenty of headroom to avoid running out of disk space.)
There are two commonly used internal WAL functions:
XLogInsertRecord
and XLogFlush
.
XLogInsertRecord
is used to place a new record into
the WAL buffers in shared memory. If there is no
space for the new record, XLogInsertRecord
will have
to write (move to kernel cache) a few filled WAL
buffers. This is undesirable because XLogInsertRecord
is used on every database low level modification (for example, row
insertion) at a time when an exclusive lock is held on affected
data pages, so the operation needs to be as fast as possible. What
is worse, writing WAL buffers might also force the
creation of a new log segment, which takes even more
time. Normally, WAL buffers should be written
and flushed by an XLogFlush
request, which is
made, for the most part, at transaction commit time to ensure that
transaction records are flushed to permanent storage. On systems
with high log output, XLogFlush
requests might
not occur often enough to prevent XLogInsertRecord
from having to do writes. On such systems
one should increase the number of WAL buffers by
modifying the wal_buffers parameter. When
full_page_writes is set and the system is very busy,
setting wal_buffers
higher will help smooth response times
during the period immediately following each checkpoint.
The commit_delay parameter defines for how many
microseconds a group commit leader process will sleep after acquiring a
lock within XLogFlush
, while group commit
followers queue up behind the leader. This delay allows other server
processes to add their commit records to the WAL buffers so that all of
them will be flushed by the leader's eventual sync operation. No sleep
will occur if fsync is not enabled, or if fewer
than commit_siblings other sessions are currently
in active transactions; this avoids sleeping when it's unlikely that
any other session will commit soon. Note that on some platforms, the
resolution of a sleep request is ten milliseconds, so that any nonzero
commit_delay
setting between 1 and 10000
microseconds would have the same effect. Note also that on some
platforms, sleep operations may take slightly longer than requested by
the parameter.
Since the purpose of commit_delay
is to allow the
cost of each flush operation to be amortized across concurrently
committing transactions (potentially at the expense of transaction
latency), it is necessary to quantify that cost before the setting can
be chosen intelligently. The higher that cost is, the more effective
commit_delay
is expected to be in increasing
transaction throughput, up to a point. The pg_test_fsync program can be used to measure the average time
in microseconds that a single WAL flush operation takes. A value of
half of the average time the program reports it takes to flush after a
single 8kB write operation is often the most effective setting for
commit_delay
, so this value is recommended as the
starting point to use when optimizing for a particular workload. While
tuning commit_delay
is particularly useful when the
WAL log is stored on high-latency rotating disks, benefits can be
significant even on storage media with very fast sync times, such as
solid-state drives or RAID arrays with a battery-backed write cache;
but this should definitely be tested against a representative workload.
Higher values of commit_siblings
should be used in
such cases, whereas smaller commit_siblings
values
are often helpful on higher latency media. Note that it is quite
possible that a setting of commit_delay
that is too
high can increase transaction latency by so much that total transaction
throughput suffers.
When commit_delay
is set to zero (the default), it
is still possible for a form of group commit to occur, but each group
will consist only of sessions that reach the point where they need to
flush their commit records during the window in which the previous
flush operation (if any) is occurring. At higher client counts a
“gangway effect” tends to occur, so that the effects of group
commit become significant even when commit_delay
is
zero, and thus explicitly setting commit_delay
tends
to help less. Setting commit_delay
can only help
when (1) there are some concurrently committing transactions, and (2)
throughput is limited to some degree by commit rate; but with high
rotational latency this setting can be effective in increasing
transaction throughput with as few as two clients (that is, a single
committing client with one sibling transaction).
The wal_sync_method parameter determines how
PostgreSQL will ask the kernel to force
WAL updates out to disk.
All the options should be the same in terms of reliability, with
the exception of fsync_writethrough
, which can sometimes
force a flush of the disk cache even when other options do not do so.
However, it's quite platform-specific which one will be the fastest.
You can test the speeds of different options using the pg_test_fsync program.
Note that this parameter is irrelevant if fsync
has been turned off.
Enabling the wal_debug configuration parameter
(provided that PostgreSQL has been
compiled with support for it) will result in each
XLogInsertRecord
and XLogFlush
WAL call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
There are two internal functions to write WAL data to disk:
XLogWrite
and issue_xlog_fsync
.
When track_wal_io_timing is enabled, the total
amounts of time XLogWrite
writes and
issue_xlog_fsync
syncs WAL data to disk are counted as
wal_write_time
and wal_sync_time
in
pg_stat_wal, respectively.
XLogWrite
is normally called by
XLogInsertRecord
(when there is no space for the new
record in WAL buffers), XLogFlush
and the WAL writer,
to write WAL buffers to disk and call issue_xlog_fsync
.
issue_xlog_fsync
is normally called by
XLogWrite
to sync WAL files to disk.
If wal_sync_method
is either
open_datasync
or open_sync
,
a write operation in XLogWrite
guarantees to sync written
WAL data to disk and issue_xlog_fsync
does nothing.
If wal_sync_method
is either fdatasync
,
fsync
, or fsync_writethrough
,
the write operation moves WAL buffers to kernel cache and
issue_xlog_fsync
syncs them to disk. Regardless
of the setting of track_wal_io_timing
, the number
of times XLogWrite
writes and
issue_xlog_fsync
syncs WAL data to disk are also
counted as wal_write
and wal_sync
in pg_stat_wal
, respectively.
WAL is automatically enabled; no action is required from the administrator except ensuring that the disk-space requirements for the WAL logs are met, and that any necessary tuning is done (see Section 30.5).
WAL records are appended to the WAL
logs as each new record is written. The insert position is described by
a Log Sequence Number (LSN) that is a byte offset into
the logs, increasing monotonically with each new record.
LSN values are returned as the datatype
pg_lsn
. Values can be
compared to calculate the volume of WAL data that
separates them, so they are used to measure the progress of replication
and recovery.
WAL logs are stored in the directory
pg_wal
under the data directory, as a set of
segment files, normally each 16 MB in size (but the size can be changed
by altering the --wal-segsize
initdb option). Each segment is
divided into pages, normally 8 kB each (this size can be changed via the
--with-wal-blocksize
configure option). The log record headers
are described in access/xlogrecord.h
; the record
content is dependent on the type of event that is being logged. Segment
files are given ever-increasing numbers as names, starting at
000000010000000000000001
. The numbers do not wrap,
but it will take a very, very long time to exhaust the
available stock of numbers.
It is advantageous if the log is located on a different disk from the
main database files. This can be achieved by moving the
pg_wal
directory to another location (while the server
is shut down, of course) and creating a symbolic link from the
original location in the main data directory to the new location.
The aim of WAL is to ensure that the log is written before database records are altered, but this can be subverted by disk drives that falsely report a successful write to the kernel, when in fact they have only cached the data and not yet stored it on the disk. A power failure in such a situation might lead to irrecoverable data corruption. Administrators should try to ensure that disks holding PostgreSQL's WAL log files do not make such false reports. (See Section 30.1.)
After a checkpoint has been made and the log flushed, the
checkpoint's position is saved in the file
pg_control
. Therefore, at the start of recovery,
the server first reads pg_control
and
then the checkpoint record; then it performs the REDO operation by
scanning forward from the log location indicated in the checkpoint
record. Because the entire content of data pages is saved in the
log on the first page modification after a checkpoint (assuming
full_page_writes is not disabled), all pages
changed since the checkpoint will be restored to a consistent
state.
To deal with the case where pg_control
is
corrupt, we should support the possibility of scanning existing log
segments in reverse order — newest to oldest — in order to find the
latest checkpoint. This has not been implemented yet.
pg_control
is small enough (less than one disk page)
that it is not subject to partial-write problems, and as of this writing
there have been no reports of database failures due solely to the inability
to read pg_control
itself. So while it is
theoretically a weak spot, pg_control
does not
seem to be a problem in practice.
Table of Contents
Logical replication is a method of replicating data objects and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. PostgreSQL supports both mechanisms concurrently, see Chapter 27. Logical replication allows fine-grained control over both data replication and security.
Logical replication uses a publish and subscribe model with one or more subscribers subscribing to one or more publications on a publisher node. Subscribers pull data from the publications they subscribe to and may subsequently re-publish data to allow cascading replication or more complex configurations.
Logical replication of a table typically starts with taking a snapshot of the data on the publisher database and copying that to the subscriber. Once that is done, the changes on the publisher are sent to the subscriber as they occur in real-time. The subscriber applies the data in the same order as the publisher so that transactional consistency is guaranteed for publications within a single subscription. This method of data replication is sometimes referred to as transactional replication.
The typical use-cases for logical replication are:
Sending incremental changes in a single database or a subset of a database to subscribers as they occur.
Firing triggers for individual changes as they arrive on the subscriber.
Consolidating multiple databases into a single one (for example for analytical purposes).
Replicating between different major versions of PostgreSQL.
Replicating between PostgreSQL instances on different platforms (for example Linux to Windows)
Giving access to replicated data to different groups of users.
Sharing a subset of the database between multiple databases.
The subscriber database behaves in the same way as any other PostgreSQL instance and can be used as a publisher for other databases by defining its own publications. When the subscriber is treated as read-only by application, there will be no conflicts from a single subscription. On the other hand, if there are other writes done either by an application or by other subscribers to the same set of tables, conflicts can arise.
A publication can be defined on any physical replication primary. The node where a publication is defined is referred to as publisher. A publication is a set of changes generated from a table or a group of tables, and might also be described as a change set or replication set. Each publication exists in only one database.
Publications are different from schemas and do not affect how the table is
accessed. Each table can be added to multiple publications if needed.
Publications may currently only contain tables. Objects must be added
explicitly, except when a publication is created for ALL
TABLES
.
Publications can choose to limit the changes they produce to
any combination of INSERT
, UPDATE
,
DELETE
, and TRUNCATE
, similar to how triggers are fired by
particular event types. By default, all operation types are replicated.
A published table must have a “replica identity” configured in
order to be able to replicate UPDATE
and DELETE
operations, so that appropriate rows to
update or delete can be identified on the subscriber side. By default,
this is the primary key, if there is one. Another unique index (with
certain additional requirements) can also be set to be the replica
identity. If the table does not have any suitable key, then it can be set
to replica identity “full”, which means the entire row becomes
the key. This, however, is very inefficient and should only be used as a
fallback if no other solution is possible. If a replica identity other
than “full” is set on the publisher side, a replica identity
comprising the same or fewer columns must also be set on the subscriber
side. See REPLICA IDENTITY
for details on
how to set the replica identity. If a table without a replica identity is
added to a publication that replicates UPDATE
or DELETE
operations then
subsequent UPDATE
or DELETE
operations will cause an error on the publisher. INSERT
operations can proceed regardless of any replica identity.
Every publication can have multiple subscribers.
A publication is created using the CREATE PUBLICATION
command and may later be altered or dropped using corresponding commands.
The individual tables can be added and removed dynamically using
ALTER PUBLICATION
. Both the ADD
TABLE
and DROP TABLE
operations are
transactional; so the table will start or stop replicating at the correct
snapshot once the transaction has committed.
A subscription is the downstream side of logical replication. The node where a subscription is defined is referred to as the subscriber. A subscription defines the connection to another database and set of publications (one or more) to which it wants to subscribe.
The subscriber database behaves in the same way as any other PostgreSQL instance and can be used as a publisher for other databases by defining its own publications.
A subscriber node may have multiple subscriptions if desired. It is possible to define multiple subscriptions between a single publisher-subscriber pair, in which case care must be taken to ensure that the subscribed publication objects don't overlap.
Each subscription will receive changes via one replication slot (see Section 27.2.6). Additional replication slots may be required for the initial data synchronization of pre-existing table data and those will be dropped at the end of data synchronization.
A logical replication subscription can be a standby for synchronous
replication (see Section 27.2.8). The standby
name is by default the subscription name. An alternative name can be
specified as application_name
in the connection
information of the subscription.
Subscriptions are dumped by pg_dump
if the current user
is a superuser. Otherwise a warning is written and subscriptions are
skipped, because non-superusers cannot read all subscription information
from the pg_subscription
catalog.
The subscription is added using CREATE SUBSCRIPTION
and
can be stopped/resumed at any time using the
ALTER SUBSCRIPTION
command and removed using
DROP SUBSCRIPTION
.
When a subscription is dropped and recreated, the synchronization information is lost. This means that the data has to be resynchronized afterwards.
The schema definitions are not replicated, and the published tables must exist on the subscriber. Only regular tables may be the target of replication. For example, you can't replicate to a view.
The tables are matched between the publisher and the subscriber using the fully qualified table name. Replication to differently-named tables on the subscriber is not supported.
Columns of a table are also matched by name. The order of columns in the
subscriber table does not need to match that of the publisher. The data
types of the columns do not need to match, as long as the text
representation of the data can be converted to the target type. For
example, you can replicate from a column of type integer
to a
column of type bigint
. The target table can also have
additional columns not provided by the published table. Any such columns
will be filled with the default value as specified in the definition of the
target table.
As mentioned earlier, each (active) subscription receives changes from a replication slot on the remote (publishing) side.
Additional table synchronization slots are normally transient, created
internally to perform initial table synchronization and dropped
automatically when they are no longer needed. These table synchronization
slots have generated names: “pg_%u_sync_%u_%llu
”
(parameters: Subscription oid
,
Table relid
, system identifier sysid
)
Normally, the remote replication slot is created automatically when the
subscription is created using CREATE SUBSCRIPTION
and it
is dropped automatically when the subscription is dropped using
DROP SUBSCRIPTION
. In some situations, however, it can
be useful or necessary to manipulate the subscription and the underlying
replication slot separately. Here are some scenarios:
When creating a subscription, the replication slot already exists. In
that case, the subscription can be created using
the create_slot = false
option to associate with the
existing slot.
When creating a subscription, the remote host is not reachable or in an
unclear state. In that case, the subscription can be created using
the connect = false
option. The remote host will then not
be contacted at all. This is what pg_dump
uses. The remote replication slot will then have to be created
manually before the subscription can be activated.
When dropping a subscription, the replication slot should be kept.
This could be useful when the subscriber database is being moved to a
different host and will be activated from there. In that case,
disassociate the slot from the subscription using ALTER
SUBSCRIPTION
before attempting to drop the subscription.
When dropping a subscription, the remote host is not reachable. In
that case, disassociate the slot from the subscription
using ALTER SUBSCRIPTION
before attempting to drop
the subscription. If the remote database instance no longer exists, no
further action is then necessary. If, however, the remote database
instance is just unreachable, the replication slot (and any still
remaining table synchronization slots) should then be
dropped manually; otherwise it/they would continue to reserve WAL and might
eventually cause the disk to fill up. Such cases should be carefully
investigated.
Logical replication behaves similarly to normal DML operations in that
the data will be updated even if it was changed locally on the subscriber
node. If incoming data violates any constraints the replication will
stop. This is referred to as a conflict. When
replicating UPDATE
or DELETE
operations, missing data will not produce a conflict and such operations
will simply be skipped.
A conflict will produce an error and will stop the replication; it must be resolved manually by the user. Details about the conflict can be found in the subscriber's server log.
The resolution can be done either by changing data on the subscriber so
that it does not conflict with the incoming change or by skipping the
transaction that conflicts with the existing data. The transaction can be
skipped by calling the
pg_replication_origin_advance()
function with
a node_name
corresponding to the subscription name,
and a position. The current position of origins can be seen in the
pg_replication_origin_status
system view.
Logical replication currently has the following restrictions or missing functionality. These might be addressed in future releases.
The database schema and DDL commands are not replicated. The initial
schema can be copied by hand using pg_dump
--schema-only
. Subsequent schema changes would need to be kept
in sync manually. (Note, however, that there is no need for the schemas
to be absolutely the same on both sides.) Logical replication is robust
when schema definitions change in a live database: When the schema is
changed on the publisher and replicated data starts arriving at the
subscriber but does not fit into the table schema, replication will error
until the schema is updated. In many cases, intermittent errors can be
avoided by applying additive schema changes to the subscriber first.
Sequence data is not replicated. The data in serial or identity columns
backed by sequences will of course be replicated as part of the table,
but the sequence itself would still show the start value on the
subscriber. If the subscriber is used as a read-only database, then this
should typically not be a problem. If, however, some kind of switchover
or failover to the subscriber database is intended, then the sequences
would need to be updated to the latest values, either by copying the
current data from the publisher (perhaps
using pg_dump
) or by determining a sufficiently high
value from the tables themselves.
Replication of TRUNCATE
commands is supported, but
some care must be taken when truncating groups of tables connected by
foreign keys. When replicating a truncate action, the subscriber will
truncate the same group of tables that was truncated on the publisher,
either explicitly specified or implicitly collected via
CASCADE
, minus tables that are not part of the
subscription. This will work correctly if all affected tables are part
of the same subscription. But if some tables to be truncated on the
subscriber have foreign-key links to tables that are not part of the same
(or any) subscription, then the application of the truncate action on the
subscriber will fail.
Large objects (see Chapter 35) are not replicated. There is no workaround for that, other than storing data in normal tables.
Replication is only supported by tables, including partitioned tables. Attempts to replicate other types of relations, such as views, materialized views, or foreign tables, will result in an error.
When replicating between partitioned tables, the actual replication
originates, by default, from the leaf partitions on the publisher, so
partitions on the publisher must also exist on the subscriber as valid
target tables. (They could either be leaf partitions themselves, or they
could be further subpartitioned, or they could even be independent
tables.) Publications can also specify that changes are to be replicated
using the identity and schema of the partitioned root table instead of
that of the individual leaf partitions in which the changes actually
originate (see CREATE PUBLICATION
).
Logical replication starts by copying a snapshot of the data on the publisher database. Once that is done, changes on the publisher are sent to the subscriber as they occur in real time. The subscriber applies data in the order in which commits were made on the publisher so that transactional consistency is guaranteed for the publications within any single subscription.
Logical replication is built with an architecture similar to physical
streaming replication (see Section 27.2.5). It is
implemented by “walsender” and “apply”
processes. The walsender process starts logical decoding (described
in Chapter 49) of the WAL and loads the standard
logical decoding output plugin (pgoutput
). The plugin
transforms the changes read
from WAL to the logical replication protocol
(see Section 53.5) and filters the data
according to the publication specification. The data is then continuously
transferred using the streaming replication protocol to the apply worker,
which maps the data to local tables and applies the individual changes as
they are received, in correct transactional order.
The apply process on the subscriber database always runs with
session_replication_role
set to replica
. This means that, by default,
triggers and rules will not fire on a subscriber. Users can optionally choose to
enable triggers and rules on a table using the
ALTER TABLE
command
and the ENABLE TRIGGER
and ENABLE RULE
clauses.
The logical replication apply process currently only fires row triggers,
not statement triggers. The initial table synchronization, however, is
implemented like a COPY
command and thus fires both row
and statement triggers for INSERT
.
The initial data in existing subscribed tables are snapshotted and copied in a parallel instance of a special kind of apply process. This process will create its own replication slot and copy the existing data. As soon as the copy is finished the table contents will become visible to other backends. Once existing data is copied, the worker enters synchronization mode, which ensures that the table is brought up to a synchronized state with the main apply process by streaming any changes that happened during the initial data copy using standard logical replication. During this synchronization phase, the changes are applied and committed in the same order as they happened on the publisher. Once synchronization is done, control of the replication of the table is given back to the main apply process where replication continues as normal.
Because logical replication is based on a similar architecture as physical streaming replication, the monitoring on a publication node is similar to monitoring of a physical replication primary (see Section 27.2.5.2).
The monitoring information about subscription is visible in
pg_stat_subscription
.
This view contains one row for every subscription worker. A subscription
can have zero or more active subscription workers depending on its state.
Normally, there is a single apply process running for an enabled subscription. A disabled subscription or a crashed subscription will have zero rows in this view. If the initial data synchronization of any table is in progress, there will be additional workers for the tables being synchronized.
A user able to modify the schema of subscriber-side tables can execute
arbitrary code as a superuser. Limit ownership
and TRIGGER
privilege on such tables to roles that
superusers trust. Moreover, if untrusted users can create tables, use only
publications that list tables explicitly. That is to say, create a
subscription FOR ALL TABLES
only when superusers trust
every user permitted to create a non-temp table on the publisher or the
subscriber.
The role used for the replication connection must have
the REPLICATION
attribute (or be a superuser). If the
role lacks SUPERUSER
and BYPASSRLS
,
publisher row security policies can execute. If the role does not trust
all table owners, include options=-crow_security=off
in
the connection string; if a table owner then adds a row security policy,
that setting will cause replication to halt rather than execute the policy.
Access for the role must be configured in pg_hba.conf
and it must have the LOGIN
attribute.
In order to be able to copy the initial table data, the role used for the
replication connection must have the SELECT
privilege on
a published table (or be a superuser).
To create a publication, the user must have the CREATE
privilege in the database.
To add tables to a publication, the user must have ownership rights on the table. To create a publication that publishes all tables automatically, the user must be a superuser.
To create a subscription, the user must be a superuser.
The subscription apply process will run in the local database with the privileges of a superuser.
Privileges are only checked once at the start of a replication connection. They are not re-checked as each change record is read from the publisher, nor are they re-checked for each change when applied.
Logical replication requires several configuration options to be set.
On the publisher side, wal_level
must be set to
logical
, and max_replication_slots
must be set to at least the number of subscriptions expected to connect,
plus some reserve for table synchronization. And
max_wal_senders
should be set to at least the same as
max_replication_slots
plus the number of physical
replicas that are connected at the same time.
max_replication_slots
must also be set on the subscriber.
It should be set to at least the number of subscriptions that will be added
to the subscriber, plus some reserve for table synchronization.
max_logical_replication_workers
must be set to at least
the number of subscriptions, again plus some reserve for the table
synchronization. Additionally the max_worker_processes
may need to be adjusted to accommodate for replication workers, at least
(max_logical_replication_workers
+ 1
). Note that some extensions and parallel queries
also take worker slots from max_worker_processes
.
First set the configuration options in postgresql.conf
:
wal_level = logical
The other required settings have default values that are sufficient for a basic setup.
pg_hba.conf
needs to be adjusted to allow replication
(the values here depend on your actual network configuration and user you
want to use for connecting):
host all repuser 0.0.0.0/0 md5
Then on the publisher database:
CREATE PUBLICATION mypub FOR TABLE users, departments;
And on the subscriber database:
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=foo host=bar user=repuser' PUBLICATION mypub;
The above will start the replication process, which synchronizes the
initial table contents of the tables users
and
departments
and then starts replicating
incremental changes to those tables.
Table of Contents
This chapter explains what just-in-time compilation is, and how it can be configured in PostgreSQL.
Just-in-Time (JIT) compilation is the process of turning
some form of interpreted program evaluation into a native program, and
doing so at run time.
For example, instead of using general-purpose code that can evaluate
arbitrary SQL expressions to evaluate a particular SQL predicate
like WHERE a.col = 3
, it is possible to generate a
function that is specific to that expression and can be natively executed
by the CPU, yielding a speedup.
PostgreSQL has builtin support to perform
JIT compilation using LLVM when
PostgreSQL is built with
--with-llvm
.
See src/backend/jit/README
for further details.
Currently PostgreSQL's JIT implementation has support for accelerating expression evaluation and tuple deforming. Several other operations could be accelerated in the future.
Expression evaluation is used to evaluate WHERE
clauses, target lists, aggregates and projections. It can be accelerated
by generating code specific to each case.
Tuple deforming is the process of transforming an on-disk tuple (see Section 70.6.1) into its in-memory representation. It can be accelerated by creating a function specific to the table layout and the number of columns to be extracted.
PostgreSQL is very extensible and allows new data types, functions, operators and other database objects to be defined; see Chapter 38. In fact the built-in objects are implemented using nearly the same mechanisms. This extensibility implies some overhead, for example due to function calls (see Section 38.3). To reduce that overhead, JIT compilation can inline the bodies of small functions into the expressions using them. That allows a significant percentage of the overhead to be optimized away.
LLVM has support for optimizing generated code. Some of the optimizations are cheap enough to be performed whenever JIT is used, while others are only beneficial for longer-running queries. See https://llvm.org/docs/Passes.html#transform-passes for more details about optimizations.
JIT compilation is beneficial primarily for long-running CPU-bound queries. Frequently these will be analytical queries. For short queries the added overhead of performing JIT compilation will often be higher than the time it can save.
To determine whether JIT compilation should be used, the total estimated cost of a query (see Chapter 72 and Section 20.7.2) is used. The estimated cost of the query will be compared with the setting of jit_above_cost. If the cost is higher, JIT compilation will be performed. Two further decisions are then needed. Firstly, if the estimated cost is more than the setting of jit_inline_above_cost, short functions and operators used in the query will be inlined. Secondly, if the estimated cost is more than the setting of jit_optimize_above_cost, expensive optimizations are applied to improve the generated code. Each of these options increases the JIT compilation overhead, but can reduce query execution time considerably.
These cost-based decisions will be made at plan time, not execution time. This means that when prepared statements are in use, and a generic plan is used (see PREPARE), the values of the configuration parameters in effect at prepare time control the decisions, not the settings at execution time.
If jit is set to off
, or if no
JIT implementation is available (for example because
the server was compiled without --with-llvm
),
JIT will not be performed, even if it would be
beneficial based on the above criteria. Setting jit
to off
has effects at both plan and execution time.
EXPLAIN can be used to see whether JIT is used or not. As an example, here is a query that is not using JIT:
=# EXPLAIN ANALYZE SELECT SUM(relpages) FROM pg_class; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Aggregate (cost=16.27..16.29 rows=1 width=8) (actual time=0.303..0.303 rows=1 loops=1) -> Seq Scan on pg_class (cost=0.00..15.42 rows=342 width=4) (actual time=0.017..0.111 rows=356 loops=1) Planning Time: 0.116 ms Execution Time: 0.365 ms (4 rows)
Given the cost of the plan, it is entirely reasonable that no JIT was used; the cost of JIT would have been bigger than the potential savings. Adjusting the cost limits will lead to JIT use:
=# SET jit_above_cost = 10; SET =# EXPLAIN ANALYZE SELECT SUM(relpages) FROM pg_class; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Aggregate (cost=16.27..16.29 rows=1 width=8) (actual time=6.049..6.049 rows=1 loops=1) -> Seq Scan on pg_class (cost=0.00..15.42 rows=342 width=4) (actual time=0.019..0.052 rows=356 loops=1) Planning Time: 0.133 ms JIT: Functions: 3 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 1.259 ms, Inlining 0.000 ms, Optimization 0.797 ms, Emission 5.048 ms, Total 7.104 ms Execution Time: 7.416 ms
As visible here, JIT was used, but inlining and expensive optimization were not. If jit_inline_above_cost or jit_optimize_above_cost were also lowered, that would change.
The configuration variable jit determines whether JIT compilation is enabled or disabled. If it is enabled, the configuration variables jit_above_cost, jit_inline_above_cost, and jit_optimize_above_cost determine whether JIT compilation is performed for a query, and how much effort is spent doing so.
jit_provider determines which JIT implementation is used. It is rarely required to be changed. See Section 32.4.2.
For development and debugging purposes a few additional configuration parameters exist, as described in Section 20.17.
PostgreSQL's JIT
implementation can inline the bodies of functions
of types C
and internal
, as well as
operators based on such functions. To do so for functions in extensions,
the definitions of those functions need to be made available.
When using PGXS to build an extension
against a server that has been compiled with LLVM JIT support, the
relevant files will be built and installed automatically.
The relevant files have to be installed into
$pkglibdir/bitcode/$extension/
and a summary of them
into $pkglibdir/bitcode/$extension.index.bc
, where
$pkglibdir
is the directory returned by
pg_config --pkglibdir
and $extension
is the base name of the extension's shared library.
For functions built into PostgreSQL itself,
the bitcode is installed into
$pkglibdir/bitcode/postgres
.
PostgreSQL provides a JIT implementation based on LLVM. The interface to the JIT provider is pluggable and the provider can be changed without recompiling (although currently, the build process only provides inlining support data for LLVM). The active provider is chosen via the setting jit_provider.
A JIT provider is loaded by dynamically loading the
named shared library. The normal library search path is used to locate
the library. To provide the required JIT provider
callbacks and to indicate that the library is actually a
JIT provider, it needs to provide a C function named
_PG_jit_provider_init
. This function is passed a
struct that needs to be filled with the callback function pointers for
individual actions:
struct JitProviderCallbacks { JitProviderResetAfterErrorCB reset_after_error; JitProviderReleaseContextCB release_context; JitProviderCompileExprCB compile_expr; }; extern void _PG_jit_provider_init(JitProviderCallbacks *cb);
Table of Contents
The regression tests are a comprehensive set of tests for the SQL implementation in PostgreSQL. They test standard SQL operations as well as the extended capabilities of PostgreSQL.
The regression tests can be run against an already installed and running server, or using a temporary installation within the build tree. Furthermore, there is a “parallel” and a “sequential” mode for running the tests. The sequential method runs each test script alone, while the parallel method starts up multiple server processes to run groups of tests in parallel. Parallel testing adds confidence that interprocess communication and locking are working correctly.
To run the parallel regression tests after building but before installation, type:
make check
in the top-level directory. (Or you can change to
src/test/regress
and run the command there.)
At the end you should see something like:
=======================
All 193 tests passed.
=======================
or otherwise a note about which tests failed. See Section 33.2 below before assuming that a “failure” represents a serious problem.
Because this test method runs a temporary server, it will not work if you did the build as the root user, since the server will not start as root. Recommended procedure is not to do the build as root, or else to perform testing after completing the installation.
If you have configured PostgreSQL to install
into a location where an older PostgreSQL
installation already exists, and you perform make check
before installing the new version, you might find that the tests fail
because the new programs try to use the already-installed shared
libraries. (Typical symptoms are complaints about undefined symbols.)
If you wish to run the tests before overwriting the old installation,
you'll need to build with configure --disable-rpath
.
It is not recommended that you use this option for the final installation,
however.
The parallel regression test starts quite a few processes under your
user ID. Presently, the maximum concurrency is twenty parallel test
scripts, which means forty processes: there's a server process and a
psql process for each test script.
So if your system enforces a per-user limit on the number of processes,
make sure this limit is at least fifty or so, else you might get
random-seeming failures in the parallel test. If you are not in
a position to raise the limit, you can cut down the degree of parallelism
by setting the MAX_CONNECTIONS
parameter. For example:
make MAX_CONNECTIONS=10 check
runs no more than ten tests concurrently.
To run the tests after installation (see Chapter 17), initialize a data directory and start the server as explained in Chapter 19, then type:
make installcheck
or for a parallel test:
make installcheck-parallel
The tests will expect to contact the server at the local host and the
default port number, unless directed otherwise by PGHOST
and
PGPORT
environment variables. The tests will be run in a
database named regression
; any existing database by this name
will be dropped.
The tests will also transiently create some cluster-wide objects, such as
roles, tablespaces, and subscriptions. These objects will have names
beginning with regress_
. Beware of
using installcheck
mode with an installation that has
any actual global objects named that way.
The make check
and make installcheck
commands
run only the “core” regression tests, which test built-in
functionality of the PostgreSQL server. The source
distribution contains many additional test suites, most of them having
to do with add-on functionality such as optional procedural languages.
To run all test suites applicable to the modules that have been selected to be built, including the core tests, type one of these commands at the top of the build tree:
make check-world make installcheck-world
These commands run the tests using temporary servers or an
already-installed server, respectively, just as previously explained
for make check
and make installcheck
. Other
considerations are the same as previously explained for each method.
Note that make check-world
builds a separate instance
(temporary data directory) for each tested module, so it requires more
time and disk space than make installcheck-world
.
On a modern machine with multiple CPU cores and no tight operating-system limits, you can make things go substantially faster with parallelism. The recipe that most PostgreSQL developers actually use for running all tests is something like
make check-world -j8 >/dev/null
with a -j
limit near to or a bit more than the number
of available cores. Discarding stdout
eliminates chatter that's not interesting when you just want to verify
success. (In case of failure, the stderr
messages are usually enough to determine where to look closer.)
Alternatively, you can run individual test suites by typing
make check
or make installcheck
in the appropriate
subdirectory of the build tree. Keep in mind that make
installcheck
assumes you've installed the relevant module(s), not
only the core server.
The additional tests that can be invoked this way include:
Regression tests for optional procedural languages.
These are located under src/pl
.
Regression tests for contrib
modules,
located under contrib
.
Not all contrib
modules have tests.
Regression tests for the interface libraries,
located in src/interfaces/libpq/test
and
src/interfaces/ecpg/test
.
Tests for core-supported authentication methods,
located in src/test/authentication
.
(See below for additional authentication-related tests.)
Tests stressing behavior of concurrent sessions,
located in src/test/isolation
.
Tests for crash recovery and physical replication,
located in src/test/recovery
.
Tests for logical replication,
located in src/test/subscription
.
Tests of client programs, located under src/bin
.
When using installcheck
mode, these tests will create
and destroy test databases whose names
include regression
, for
example pl_regression
or contrib_regression
. Beware of
using installcheck
mode with an installation that has
any non-test databases named that way.
Some of these auxiliary test suites use the TAP infrastructure explained
in Section 33.4.
The TAP-based tests are run only when PostgreSQL was configured with the
option --enable-tap-tests
. This is recommended for
development, but can be omitted if there is no suitable Perl installation.
Some test suites are not run by default, either because they are not secure
to run on a multiuser system or because they require special software. You
can decide which test suites to run additionally by setting the
make
or environment variable
PG_TEST_EXTRA
to a whitespace-separated list, for
example:
make check-world PG_TEST_EXTRA='kerberos ldap ssl'
The following values are currently supported:
kerberos
Runs the test suite under src/test/kerberos
. This
requires an MIT Kerberos installation and opens TCP/IP listen sockets.
ldap
Runs the test suite under src/test/ldap
. This
requires an OpenLDAP installation and opens
TCP/IP listen sockets.
ssl
Runs the test suite under src/test/ssl
. This opens TCP/IP listen sockets.
Tests for features that are not supported by the current build
configuration are not run even if they are mentioned in
PG_TEST_EXTRA
.
In addition, there are tests in src/test/modules
which will be run by make check-world
but not
by make installcheck-world
. This is because they
install non-production extensions or have other side-effects that are
considered undesirable for a production installation. You can
use make install
and make
installcheck
in one of those subdirectories if you wish,
but it's not recommended to do so with a non-test server.
By default, tests using a temporary installation use the
locale defined in the current environment and the corresponding
database encoding as determined by initdb
. It
can be useful to test different locales by setting the appropriate
environment variables, for example:
make check LANG=C make check LC_COLLATE=en_US.utf8 LC_CTYPE=fr_CA.utf8
For implementation reasons, setting LC_ALL
does not
work for this purpose; all the other locale-related environment
variables do work.
When testing against an existing installation, the locale is determined by the existing database cluster and cannot be set separately for the test run.
You can also choose the database encoding explicitly by setting
the variable ENCODING
, for example:
make check LANG=C ENCODING=EUC_JP
Setting the database encoding this way typically only makes sense if the locale is C; otherwise the encoding is chosen automatically from the locale, and specifying an encoding that does not match the locale will result in an error.
The database encoding can be set for tests against either a temporary or an existing installation, though in the latter case it must be compatible with the installation's locale.
Custom server settings to use when running a regression test suite can be
set in the PGOPTIONS
environment variable (for settings
that allow this):
make check PGOPTIONS="-c force_parallel_mode=regress -c work_mem=50MB"
When running against a temporary installation, custom settings can also be
set by supplying a pre-written postgresql.conf
:
echo 'log_checkpoints = on' > test_postgresql.conf echo 'work_mem = 50MB' >> test_postgresql.conf make check EXTRA_REGRESS_OPTS="--temp-config=test_postgresql.conf"
This can be useful to enable additional logging, adjust resource limits, or enable extra run-time checks such as debug_discard_caches.
The core regression test suite contains a few test files that are not
run by default, because they might be platform-dependent or take a
very long time to run. You can run these or other extra test
files by setting the variable EXTRA_TESTS
. For
example, to run the numeric_big
test:
make check EXTRA_TESTS=numeric_big
The source distribution also contains regression tests for the static behavior of Hot Standby. These tests require a running primary server and a running standby server that is accepting new WAL changes from the primary (using either file-based log shipping or streaming replication). Those servers are not automatically created for you, nor is replication setup documented here. Please check the various sections of the documentation devoted to the required commands and related issues.
To run the Hot Standby tests, first create a database
called regression
on the primary:
psql -h primary -c "CREATE DATABASE regression"
Next, run the preparatory script
src/test/regress/sql/hs_primary_setup.sql
on the primary in the regression database, for example:
psql -h primary -f src/test/regress/sql/hs_primary_setup.sql regression
Allow these changes to propagate to the standby.
Now arrange for the default database connection to be to the standby
server under test (for example, by setting the PGHOST
and
PGPORT
environment variables).
Finally, run make standbycheck
in the regression directory:
cd src/test/regress make standbycheck
Some extreme behaviors can also be generated on the primary using the
script src/test/regress/sql/hs_primary_extremes.sql
to allow the behavior of the standby to be tested.
Some properly installed and fully functional
PostgreSQL installations can
“fail” some of these regression tests due to
platform-specific artifacts such as varying floating-point representation
and message wording. The tests are currently evaluated using a simple
diff
comparison against the outputs
generated on a reference system, so the results are sensitive to
small system differences. When a test is reported as
“failed”, always examine the differences between
expected and actual results; you might find that the
differences are not significant. Nonetheless, we still strive to
maintain accurate reference files across all supported platforms,
so it can be expected that all tests pass.
The actual outputs of the regression tests are in files in the
src/test/regress/results
directory. The test
script uses diff
to compare each output
file against the reference outputs stored in the
src/test/regress/expected
directory. Any
differences are saved for your inspection in
src/test/regress/regression.diffs
.
(When running a test suite other than the core tests, these files
of course appear in the relevant subdirectory,
not src/test/regress
.)
If you don't
like the diff
options that are used by default, set the
environment variable PG_REGRESS_DIFF_OPTS
, for
instance PG_REGRESS_DIFF_OPTS='-c'
. (Or you
can run diff
yourself, if you prefer.)
If for some reason a particular platform generates a “failure” for a given test, but inspection of the output convinces you that the result is valid, you can add a new comparison file to silence the failure report in future test runs. See Section 33.3 for details.
Some of the regression tests involve intentional invalid input values. Error messages can come from either the PostgreSQL code or from the host platform system routines. In the latter case, the messages can vary between platforms, but should reflect similar information. These differences in messages will result in a “failed” regression test that can be validated by inspection.
If you run the tests against a server that was initialized with a collation-order locale other than C, then there might be differences due to sort order and subsequent failures. The regression test suite is set up to handle this problem by providing alternate result files that together are known to handle a large number of locales.
To run the tests in a different locale when using the
temporary-installation method, pass the appropriate
locale-related environment variables on
the make
command line, for example:
make check LANG=de_DE.utf8
(The regression test driver unsets LC_ALL
, so it
does not work to choose the locale using that variable.) To use
no locale, either unset all locale-related environment variables
(or set them to C
) or use the following
special invocation:
make check NO_LOCALE=1
When running the tests against an existing installation, the
locale setup is determined by the existing installation. To
change it, initialize the database cluster with a different
locale by passing the appropriate options
to initdb
.
In general, it is advisable to try to run the regression tests in the locale setup that is wanted for production use, as this will exercise the locale- and encoding-related code portions that will actually be used in production. Depending on the operating system environment, you might get failures, but then you will at least know what locale-specific behaviors to expect when running real applications.
Most of the date and time results are dependent on the time zone
environment. The reference files are generated for time zone
PST8PDT
(Berkeley, California), and there will be
apparent failures if the tests are not run with that time zone setting.
The regression test driver sets environment variable
PGTZ
to PST8PDT
, which normally
ensures proper results.
Some of the tests involve computing 64-bit floating-point numbers (double
precision
) from table columns. Differences in
results involving mathematical functions of double
precision
columns have been observed. The float8
and
geometry
tests are particularly prone to small differences
across platforms, or even with different compiler optimization settings.
Human eyeball comparison is needed to determine the real
significance of these differences which are usually 10 places to
the right of the decimal point.
Some systems display minus zero as -0
, while others
just show 0
.
Some systems signal errors from pow()
and
exp()
differently from the mechanism
expected by the current PostgreSQL
code.
You might see differences in which the same rows are output in a
different order than what appears in the expected file. In most cases
this is not, strictly speaking, a bug. Most of the regression test
scripts are not so pedantic as to use an ORDER BY
for every single
SELECT
, and so their result row orderings are not well-defined
according to the SQL specification. In practice, since we are
looking at the same queries being executed on the same data by the same
software, we usually get the same result ordering on all platforms,
so the lack of ORDER BY
is not a problem. Some queries do exhibit
cross-platform ordering differences, however. When testing against an
already-installed server, ordering differences can also be caused by
non-C locale settings or non-default parameter settings, such as custom values
of work_mem
or the planner cost parameters.
Therefore, if you see an ordering difference, it's not something to
worry about, unless the query does have an ORDER BY
that your
result is violating. However, please report it anyway, so that we can add an
ORDER BY
to that particular query to eliminate the bogus
“failure” in future releases.
You might wonder why we don't order all the regression test queries explicitly to get rid of this issue once and for all. The reason is that that would make the regression tests less useful, not more, since they'd tend to exercise query plan types that produce ordered results to the exclusion of those that don't.
If the errors
test results in a server crash
at the select infinite_recurse()
command, it means that
the platform's limit on process stack size is smaller than the
max_stack_depth parameter indicates. This
can be fixed by running the server under a higher stack
size limit (4MB is recommended with the default value of
max_stack_depth
). If you are unable to do that, an
alternative is to reduce the value of max_stack_depth
.
On platforms supporting getrlimit()
, the server should
automatically choose a safe value of max_stack_depth
;
so unless you've manually overridden this setting, a failure of this
kind is a reportable bug.
The random
test script is intended to produce
random results. In very rare cases, this causes that regression
test to fail. Typing:
diff results/random.out expected/random.out
should produce only one or a few lines of differences. You need not worry unless the random test fails repeatedly.
When running the tests against an existing installation, some non-default
parameter settings could cause the tests to fail. For example, changing
parameters such as enable_seqscan
or
enable_indexscan
could cause plan changes that would
affect the results of tests that use EXPLAIN
.
Since some of the tests inherently produce environment-dependent results, we have provided ways to specify alternate “expected” result files. Each regression test can have several comparison files showing possible results on different platforms. There are two independent mechanisms for determining which comparison file is used for each test.
The first mechanism allows comparison files to be selected for
specific platforms. There is a mapping file,
src/test/regress/resultmap
, that defines
which comparison file to use for each platform.
To eliminate bogus test “failures” for a particular platform,
you first choose or make a variant result file, and then add a line to the
resultmap
file.
Each line in the mapping file is of the form
testname:output:platformpattern=comparisonfilename
The test name is just the name of the particular regression test
module. The output value indicates which output file to check. For the
standard regression tests, this is always out
. The
value corresponds to the file extension of the output file.
The platform pattern is a pattern in the style of the Unix
tool expr
(that is, a regular expression with an implicit
^
anchor at the start). It is matched against the
platform name as printed by config.guess
.
The comparison file name is the base name of the substitute result
comparison file.
For example: some systems lack a working strtof
function,
for which our workaround causes rounding errors in the
float4
regression test.
Therefore, we provide a variant comparison file,
float4-misrounded-input.out
, which includes
the results to be expected on these systems. To silence the bogus
“failure” message on HP-UX 10
platforms, resultmap
includes:
float4:out:hppa.*-hp-hpux10.*=float4-misrounded-input.out
which will trigger on any machine where the output of
config.guess
matches hppa.*-hp-hpux10.*
.
Other lines in resultmap
select the variant comparison
file for other platforms where it's appropriate.
The second selection mechanism for variant comparison files is
much more automatic: it simply uses the “best match” among
several supplied comparison files. The regression test driver
script considers both the standard comparison file for a test,
, and variant files named
testname
.out
(where the testname
_digit
.outdigit
is any single digit
0
-9
). If any such file is an exact match,
the test is considered to pass; otherwise, the one that generates
the shortest diff is used to create the failure report. (If
resultmap
includes an entry for the particular
test, then the base testname
is the substitute
name given in resultmap
.)
For example, for the char
test, the comparison file
char.out
contains results that are expected
in the C
and POSIX
locales, while
the file char_1.out
contains results sorted as
they appear in many other locales.
The best-match mechanism was devised to cope with locale-dependent results, but it can be used in any situation where the test results cannot be predicted easily from the platform name alone. A limitation of this mechanism is that the test driver cannot tell which variant is actually “correct” for the current environment; it will just pick the variant that seems to work best. Therefore it is safest to use this mechanism only for variant results that you are willing to consider equally valid in all contexts.
Various tests, particularly the client program tests
under src/bin
, use the Perl TAP tools and are run
using the Perl testing program prove
. You can pass
command-line options to prove
by setting
the make
variable PROVE_FLAGS
, for example:
make -C src/bin check PROVE_FLAGS='--timer'
See the manual page of prove
for more information.
The make
variable PROVE_TESTS
can be used to define a whitespace-separated list of paths relative
to the Makefile
invoking prove
to run the specified subset of tests instead of the default
t/*.pl
. For example:
make check PROVE_TESTS='t/001_test1.pl t/003_test3.pl'
The TAP tests require the Perl module IPC::Run
.
This module is available from CPAN or an operating system package.
They also require PostgreSQL to be
configured with the option --enable-tap-tests
.
Generically speaking, the TAP tests will test the executables in a
previously-installed installation tree if you say make
installcheck
, or will build a new local installation tree from
current sources if you say make check
. In either
case they will initialize a local instance (data directory) and
transiently run a server in it. Some of these tests run more than one
server. Thus, these tests can be fairly resource-intensive.
It's important to realize that the TAP tests will start test server(s)
even when you say make installcheck
; this is unlike
the traditional non-TAP testing infrastructure, which expects to use an
already-running test server in that case. Some PostgreSQL
subdirectories contain both traditional-style and TAP-style tests,
meaning that make installcheck
will produce a mix of
results from temporary servers and the already-running test server.
The PostgreSQL source code can be compiled with coverage testing
instrumentation, so that it becomes possible to examine which
parts of the code are covered by the regression tests or any other
test suite that is run with the code. This is currently supported
when compiling with GCC, and it requires the gcov
and lcov
programs.
A typical workflow looks like this:
./configure --enable-coverage ... OTHER OPTIONS ... make make check # or other test suite make coverage-html
Then point your HTML browser
to coverage/index.html
.
If you don't have lcov
or prefer text output over an
HTML report, you can run
make coverage
instead of make coverage-html
, which will
produce .gcov
output files for each source file
relevant to the test. (make coverage
and make
coverage-html
will overwrite each other's files, so mixing them
might be confusing.)
You can run several different tests before making the coverage report; the execution counts will accumulate. If you want to reset the execution counts between test runs, run:
make coverage-clean
You can run the make coverage-html
or make
coverage
command in a subdirectory if you want a coverage
report for only a portion of the code tree.
Use make distclean
to clean up when done.
This part describes the client programming interfaces distributed with PostgreSQL. Each of these chapters can be read independently. Note that there are many other programming interfaces for client programs that are distributed separately and contain their own documentation (Appendix H lists some of the more popular ones). Readers of this part should be familiar with using SQL commands to manipulate and query the database (see Part II) and of course with the programming language that the interface uses.
Table of Contents
COPY
Commandinformation_schema_catalog_name
administrable_role_authorizations
applicable_roles
attributes
character_sets
check_constraint_routine_usage
check_constraints
collations
collation_character_set_applicability
column_column_usage
column_domain_usage
column_options
column_privileges
column_udt_usage
columns
constraint_column_usage
constraint_table_usage
data_type_privileges
domain_constraints
domain_udt_usage
domains
element_types
enabled_roles
foreign_data_wrapper_options
foreign_data_wrappers
foreign_server_options
foreign_servers
foreign_table_options
foreign_tables
key_column_usage
parameters
referential_constraints
role_column_grants
role_routine_grants
role_table_grants
role_udt_grants
role_usage_grants
routine_column_usage
routine_privileges
routine_routine_usage
routine_sequence_usage
routine_table_usage
routines
schemata
sequences
sql_features
sql_implementation_info
sql_parts
sql_sizing
table_constraints
table_privileges
tables
transforms
triggered_update_columns
triggers
udt_privileges
usage_privileges
user_defined_types
user_mapping_options
user_mappings
view_column_usage
view_routine_usage
view_table_usage
views
Table of Contents
COPY
Commandlibpq is the C application programmer's interface to PostgreSQL. libpq is a set of library functions that allow client programs to pass queries to the PostgreSQL backend server and to receive the results of these queries.
libpq is also the underlying engine for several other PostgreSQL application interfaces, including those written for C++, Perl, Python, Tcl and ECPG. So some aspects of libpq's behavior will be important to you if you use one of those packages. In particular, Section 34.15, Section 34.16 and Section 34.19 describe behavior that is visible to the user of any application that uses libpq.
Some short programs are included at the end of this chapter (Section 34.22) to show how
to write programs that use libpq. There are also several
complete examples of libpq applications in the
directory src/test/examples
in the source code distribution.
Client programs that use libpq must
include the header file
libpq-fe.h
and must link with the libpq library.
The following functions deal with making a connection to a
PostgreSQL backend server. An
application program can have several backend connections open at
one time. (One reason to do that is to access more than one
database.) Each connection is represented by a
PGconn
object, which
is obtained from the function PQconnectdb
,
PQconnectdbParams
, or
PQsetdbLogin
. Note that these functions will always
return a non-null object pointer, unless perhaps there is too
little memory even to allocate the PGconn
object.
The PQstatus
function should be called to check
the return value for a successful connection before queries are sent
via the connection object.
If untrusted users have access to a database that has not adopted a
secure schema usage pattern,
begin each session by removing publicly-writable schemas from
search_path
. One can set parameter key
word options
to
value -csearch_path=
. Alternately, one can
issue PQexec(
after
connecting. This consideration is not specific
to libpq; it applies to every interface for
executing arbitrary SQL commands.
conn
, "SELECT
pg_catalog.set_config('search_path', '', false)")
On Unix, forking a process with open libpq connections can lead to
unpredictable results because the parent and child processes share
the same sockets and operating system resources. For this reason,
such usage is not recommended, though doing an exec
from
the child process to load a new executable is safe.
PQconnectdbParams
Makes a new connection to the database server.
PGconn *PQconnectdbParams(const char * const *keywords, const char * const *values, int expand_dbname);
This function opens a new database connection using the parameters taken
from two NULL
-terminated arrays. The first,
keywords
, is defined as an array of strings, each one
being a key word. The second, values
, gives the value
for each key word. Unlike PQsetdbLogin
below, the parameter
set can be extended without changing the function signature, so use of
this function (or its nonblocking analogs PQconnectStartParams
and PQconnectPoll
) is preferred for new application
programming.
The currently recognized parameter key words are listed in Section 34.1.2.
The passed arrays can be empty to use all default parameters, or can
contain one or more parameter settings. They must be matched in length.
Processing will stop at the first NULL
entry
in the keywords
array.
Also, if the values
entry associated with a
non-NULL
keywords
entry is
NULL
or an empty string, that entry is ignored and
processing continues with the next pair of array entries.
When expand_dbname
is non-zero, the value for
the first dbname
key word is checked to see
if it is a connection string. If so, it
is “expanded” into the individual connection
parameters extracted from the string. The value is considered to
be a connection string, rather than just a database name, if it
contains an equal sign (=
) or it begins with a
URI scheme designator. (More details on connection string formats
appear in Section 34.1.1.) Only the first
occurrence of dbname
is treated in this way;
any subsequent dbname
parameter is processed
as a plain database name.
In general the parameter arrays are processed from start to end.
If any key word is repeated, the last value (that is
not NULL
or empty) is used. This rule applies in
particular when a key word found in a connection string conflicts
with one appearing in the keywords
array. Thus,
the programmer may determine whether array entries can override or
be overridden by values taken from a connection string. Array
entries appearing before an expanded dbname
entry can be overridden by fields of the connection string, and in
turn those fields are overridden by array entries appearing
after dbname
(but, again, only if those
entries supply non-empty values).
After processing all the array entries and any expanded connection string, any connection parameters that remain unset are filled with default values. If an unset parameter's corresponding environment variable (see Section 34.15) is set, its value is used. If the environment variable is not set either, then the parameter's built-in default value is used.
PQconnectdb
Makes a new connection to the database server.
PGconn *PQconnectdb(const char *conninfo);
This function opens a new database connection using the parameters taken
from the string conninfo
.
The passed string can be empty to use all default parameters, or it can contain one or more parameter settings separated by whitespace, or it can contain a URI. See Section 34.1.1 for details.
PQsetdbLogin
Makes a new connection to the database server.
PGconn *PQsetdbLogin(const char *pghost, const char *pgport, const char *pgoptions, const char *pgtty, const char *dbName, const char *login, const char *pwd);
This is the predecessor of PQconnectdb
with a fixed
set of parameters. It has the same functionality except that the
missing parameters will always take on default values. Write NULL
or an
empty string for any one of the fixed parameters that is to be defaulted.
If the dbName
contains
an =
sign or has a valid connection URI prefix, it
is taken as a conninfo
string in exactly the same way as
if it had been passed to PQconnectdb
, and the remaining
parameters are then applied as specified for PQconnectdbParams
.
pgtty
is no longer used and any value passed will
be ignored.
PQsetdb
Makes a new connection to the database server.
PGconn *PQsetdb(char *pghost, char *pgport, char *pgoptions, char *pgtty, char *dbName);
This is a macro that calls PQsetdbLogin
with null pointers
for the login
and pwd
parameters. It is provided
for backward compatibility with very old programs.
PQconnectStartParams
PQconnectStart
PQconnectPoll
Make a connection to the database server in a nonblocking manner.
PGconn *PQconnectStartParams(const char * const *keywords, const char * const *values, int expand_dbname); PGconn *PQconnectStart(const char *conninfo); PostgresPollingStatusType PQconnectPoll(PGconn *conn);
These three functions are used to open a connection to a database server such
that your application's thread of execution is not blocked on remote I/O
whilst doing so. The point of this approach is that the waits for I/O to
complete can occur in the application's main loop, rather than down inside
PQconnectdbParams
or PQconnectdb
, and so the
application can manage this operation in parallel with other activities.
With PQconnectStartParams
, the database connection is made
using the parameters taken from the keywords
and
values
arrays, and controlled by expand_dbname
,
as described above for PQconnectdbParams
.
With PQconnectStart
, the database connection is made
using the parameters taken from the string conninfo
as
described above for PQconnectdb
.
Neither PQconnectStartParams
nor PQconnectStart
nor PQconnectPoll
will block, so long as a number of
restrictions are met:
The hostaddr
parameter must be used appropriately
to prevent DNS queries from being made. See the documentation of
this parameter in Section 34.1.2 for details.
If you call PQtrace
, ensure that the stream object
into which you trace will not block.
You must ensure that the socket is in the appropriate state
before calling PQconnectPoll
, as described below.
To begin a nonblocking connection request,
call PQconnectStart
or PQconnectStartParams
. If the result is null,
then libpq has been unable to allocate a
new PGconn
structure. Otherwise, a
valid PGconn
pointer is returned (though not
yet representing a valid connection to the database). Next
call PQstatus(conn)
. If the result
is CONNECTION_BAD
, the connection attempt has already
failed, typically because of invalid connection parameters.
If PQconnectStart
or PQconnectStartParams
succeeds, the next stage
is to poll libpq so that it can proceed with
the connection sequence.
Use PQsocket(conn)
to obtain the descriptor of the
socket underlying the database connection.
(Caution: do not assume that the socket remains the same
across PQconnectPoll
calls.)
Loop thus: If PQconnectPoll(conn)
last returned
PGRES_POLLING_READING
, wait until the socket is ready to
read (as indicated by select()
, poll()
, or
similar system function).
Then call PQconnectPoll(conn)
again.
Conversely, if PQconnectPoll(conn)
last returned
PGRES_POLLING_WRITING
, wait until the socket is ready
to write, then call PQconnectPoll(conn)
again.
On the first iteration, i.e., if you have yet to call
PQconnectPoll
, behave as if it last returned
PGRES_POLLING_WRITING
. Continue this loop until
PQconnectPoll(conn)
returns
PGRES_POLLING_FAILED
, indicating the connection procedure
has failed, or PGRES_POLLING_OK
, indicating the connection
has been successfully made.
At any time during connection, the status of the connection can be
checked by calling PQstatus
. If this call returns CONNECTION_BAD
, then the
connection procedure has failed; if the call returns CONNECTION_OK
, then the
connection is ready. Both of these states are equally detectable
from the return value of PQconnectPoll
, described above. Other states might also occur
during (and only during) an asynchronous connection procedure. These
indicate the current stage of the connection procedure and might be useful
to provide feedback to the user for example. These statuses are:
CONNECTION_STARTED
Waiting for connection to be made.
CONNECTION_MADE
Connection OK; waiting to send.
CONNECTION_AWAITING_RESPONSE
Waiting for a response from the server.
CONNECTION_AUTH_OK
Received authentication; waiting for backend start-up to finish.
CONNECTION_SSL_STARTUP
Negotiating SSL encryption.
CONNECTION_SETENV
Negotiating environment-driven parameter settings.
CONNECTION_CHECK_WRITABLE
Checking if connection is able to handle write transactions.
CONNECTION_CONSUME
Consuming any remaining response messages on connection.
Note that, although these constants will remain (in order to maintain compatibility), an application should never rely upon these occurring in a particular order, or at all, or on the status always being one of these documented values. An application might do something like this:
switch(PQstatus(conn)) { case CONNECTION_STARTED: feedback = "Connecting..."; break; case CONNECTION_MADE: feedback = "Connected to server..."; break; . . . default: feedback = "Connecting..."; }
The connect_timeout
connection parameter is ignored
when using PQconnectPoll
; it is the application's
responsibility to decide whether an excessive amount of time has elapsed.
Otherwise, PQconnectStart
followed by a
PQconnectPoll
loop is equivalent to
PQconnectdb
.
Note that when PQconnectStart
or PQconnectStartParams
returns a non-null
pointer, you must call PQfinish
when you are
finished with it, in order to dispose of the structure and any
associated memory blocks. This must be done even if the connection
attempt fails or is abandoned.
PQconndefaults
Returns the default connection options.
PQconninfoOption *PQconndefaults(void); typedef struct { char *keyword; /* The keyword of the option */ char *envvar; /* Fallback environment variable name */ char *compiled; /* Fallback compiled in default value */ char *val; /* Option's current value, or NULL */ char *label; /* Label for field in connect dialog */ char *dispchar; /* Indicates how to display this field in a connect dialog. Values are: "" Display entered value as is "*" Password field - hide value "D" Debug option - don't show by default */ int dispsize; /* Field size in characters for dialog */ } PQconninfoOption;
Returns a connection options array. This can be used to determine
all possible PQconnectdb
options and their
current default values. The return value points to an array of
PQconninfoOption
structures, which ends
with an entry having a null keyword
pointer. The
null pointer is returned if memory could not be allocated. Note that
the current default values (val
fields)
will depend on environment variables and other context. A
missing or invalid service file will be silently ignored. Callers
must treat the connection options data as read-only.
After processing the options array, free it by passing it to
PQconninfoFree
. If this is not done, a small amount of memory
is leaked for each call to PQconndefaults
.
PQconninfo
Returns the connection options used by a live connection.
PQconninfoOption *PQconninfo(PGconn *conn);
Returns a connection options array. This can be used to determine
all possible PQconnectdb
options and the
values that were used to connect to the server. The return
value points to an array of PQconninfoOption
structures, which ends with an entry having a null keyword
pointer. All notes above for PQconndefaults
also
apply to the result of PQconninfo
.
PQconninfoParse
Returns parsed connection options from the provided connection string.
PQconninfoOption *PQconninfoParse(const char *conninfo, char **errmsg);
Parses a connection string and returns the resulting options as an
array; or returns NULL
if there is a problem with the connection
string. This function can be used to extract
the PQconnectdb
options in the provided
connection string. The return value points to an array of
PQconninfoOption
structures, which ends
with an entry having a null keyword
pointer.
All legal options will be present in the result array, but the
PQconninfoOption
for any option not present
in the connection string will have val
set to
NULL
; default values are not inserted.
If errmsg
is not NULL
, then *errmsg
is set
to NULL
on success, else to a malloc
'd error string explaining
the problem. (It is also possible for *errmsg
to be
set to NULL
and the function to return NULL
;
this indicates an out-of-memory condition.)
After processing the options array, free it by passing it to
PQconninfoFree
. If this is not done, some memory
is leaked for each call to PQconninfoParse
.
Conversely, if an error occurs and errmsg
is not NULL
,
be sure to free the error string using PQfreemem
.
PQfinish
Closes the connection to the server. Also frees
memory used by the PGconn
object.
void PQfinish(PGconn *conn);
Note that even if the server connection attempt fails (as
indicated by PQstatus
), the application should call PQfinish
to free the memory used by the PGconn
object.
The PGconn
pointer must not be used again after
PQfinish
has been called.
PQreset
Resets the communication channel to the server.
void PQreset(PGconn *conn);
This function will close the connection to the server and attempt to establish a new connection, using all the same parameters previously used. This might be useful for error recovery if a working connection is lost.
PQresetStart
PQresetPoll
Reset the communication channel to the server, in a nonblocking manner.
int PQresetStart(PGconn *conn); PostgresPollingStatusType PQresetPoll(PGconn *conn);
These functions will close the connection to the server and attempt to
establish a new connection, using all the same
parameters previously used. This can be useful for error recovery if a
working connection is lost. They differ from PQreset
(above) in that they
act in a nonblocking manner. These functions suffer from the same
restrictions as PQconnectStartParams
, PQconnectStart
and PQconnectPoll
.
To initiate a connection reset, call
PQresetStart
. If it returns 0, the reset has
failed. If it returns 1, poll the reset using
PQresetPoll
in exactly the same way as you
would create the connection using PQconnectPoll
.
PQpingParams
PQpingParams
reports the status of the
server. It accepts connection parameters identical to those of
PQconnectdbParams
, described above. It is not
necessary to supply correct user name, password, or database name
values to obtain the server status; however, if incorrect values
are provided, the server will log a failed connection attempt.
PGPing PQpingParams(const char * const *keywords, const char * const *values, int expand_dbname);
The function returns one of the following values:
PQPING_OK
The server is running and appears to be accepting connections.
PQPING_REJECT
The server is running but is in a state that disallows connections (startup, shutdown, or crash recovery).
PQPING_NO_RESPONSE
The server could not be contacted. This might indicate that the server is not running, or that there is something wrong with the given connection parameters (for example, wrong port number), or that there is a network connectivity problem (for example, a firewall blocking the connection request).
PQPING_NO_ATTEMPT
No attempt was made to contact the server, because the supplied parameters were obviously incorrect or there was some client-side problem (for example, out of memory).
PQping
PQping
reports the status of the
server. It accepts connection parameters identical to those of
PQconnectdb
, described above. It is not
necessary to supply correct user name, password, or database name
values to obtain the server status; however, if incorrect values
are provided, the server will log a failed connection attempt.
PGPing PQping(const char *conninfo);
The return values are the same as for PQpingParams
.
PQsetSSLKeyPassHook_OpenSSL
PQsetSSLKeyPassHook_OpenSSL
lets an application override
libpq's default
handling of encrypted client certificate key files using
sslpassword or interactive prompting.
void PQsetSSLKeyPassHook_OpenSSL(PQsslKeyPassHook_OpenSSL_type hook);
The application passes a pointer to a callback function with signature:
int callback_fn(char *buf, int size, PGconn *conn);
which libpq will then call
instead of its default
PQdefaultSSLKeyPassHook_OpenSSL
handler. The
callback should determine the password for the key and copy it to
result-buffer buf
of size
size
. The string in buf
must be null-terminated. The callback must return the length of the
password stored in buf
excluding the null
terminator. On failure, the callback should set
buf[0] = '\0'
and return 0. See
PQdefaultSSLKeyPassHook_OpenSSL
in
libpq's source code for an example.
If the user specified an explicit key location,
its path will be in conn->sslkey
when the callback
is invoked. This will be empty if the default key path is being used.
For keys that are engine specifiers, it is up to engine implementations
whether they use the OpenSSL password
callback or define their own handling.
The app callback may choose to delegate unhandled cases to
PQdefaultSSLKeyPassHook_OpenSSL
,
or call it first and try something else if it returns 0, or completely override it.
The callback must not escape normal flow control with exceptions,
longjmp(...)
, etc. It must return normally.
PQgetSSLKeyPassHook_OpenSSL
PQgetSSLKeyPassHook_OpenSSL
returns the current
client certificate key password hook, or NULL
if none has been set.
PQsslKeyPassHook_OpenSSL_type PQgetSSLKeyPassHook_OpenSSL(void);
Several libpq functions parse a user-specified string to obtain connection parameters. There are two accepted formats for these strings: plain keyword/value strings and URIs. URIs generally follow RFC 3986, except that multi-host connection strings are allowed as further described below.
In the keyword/value format, each parameter setting is in the form
keyword
=
value
, with space(s) between settings.
Spaces around a setting's equal sign are
optional. To write an empty value, or a value containing spaces, surround it
with single quotes, for example keyword = 'a value'
.
Single quotes and backslashes within
a value must be escaped with a backslash, i.e., \'
and
\\
.
Example:
host=localhost port=5432 dbname=mydb connect_timeout=10
The recognized parameter key words are listed in Section 34.1.2.
The general form for a connection URI is:
postgresql://[userspec
@][hostspec
][/dbname
][?paramspec
] whereuserspec
is:user
[:password
] andhostspec
is: [host
][:port
][,...] andparamspec
is:name
=value
[&...]
The URI scheme designator can be either
postgresql://
or postgres://
. Each
of the remaining URI parts is optional. The
following examples illustrate valid URI syntax:
postgresql:// postgresql://localhost postgresql://localhost:5433 postgresql://localhost/mydb postgresql://user@localhost postgresql://user:secret@localhost postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp postgresql://host1:123,host2:456/somedb?target_session_attrs=any&application_name=myapp
Values that would normally appear in the hierarchical part of the URI can alternatively be given as named parameters. For example:
postgresql:///mydb?host=localhost&port=5433
All named parameters must match key words listed in
Section 34.1.2, except that for compatibility
with JDBC connection URIs, instances
of ssl=true
are translated into
sslmode=require
.
The connection URI needs to be encoded with percent-encoding
if it includes symbols with special meaning in any of its parts. Here is
an example where the equal sign (=
) is replaced with
%3D
and the space character with
%20
:
postgresql://user@localhost:5433/mydb?options=-c%20synchronous_commit%3Doff
The host part may be either a host name or an IP address. To specify an IPv6 address, enclose it in square brackets:
postgresql://[2001:db8::1234]/database
The host part is interpreted as described for the parameter host. In particular, a Unix-domain socket connection is chosen if the host part is either empty or looks like an absolute path name, otherwise a TCP/IP connection is initiated. Note, however, that the slash is a reserved character in the hierarchical part of the URI. So, to specify a non-standard Unix-domain socket directory, either omit the host part of the URI and specify the host as a named parameter, or percent-encode the path in the host part of the URI:
postgresql:///dbname?host=/var/lib/postgresql postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
It is possible to specify multiple host components, each with an optional
port component, in a single URI. A URI of the form
postgresql://host1:port1,host2:port2,host3:port3/
is equivalent to a connection string of the form
host=host1,host2,host3 port=port1,port2,port3
.
As further described below, each
host will be tried in turn until a connection is successfully established.
It is possible to specify multiple hosts to connect to, so that they are
tried in the given order. In the Keyword/Value format, the host
,
hostaddr
, and port
options accept comma-separated
lists of values. The same number of elements must be given in each
option that is specified, such
that e.g., the first hostaddr
corresponds to the first host name,
the second hostaddr
corresponds to the second host name, and so
forth. As an exception, if only one port
is specified, it
applies to all the hosts.
In the connection URI format, you can list multiple host:port
pairs
separated by commas in the host
component of the URI.
In either format, a single host name can translate to multiple network addresses. A common example of this is a host that has both an IPv4 and an IPv6 address.
When multiple hosts are specified, or when a single host name is translated to multiple addresses, all the hosts and addresses will be tried in order, until one succeeds. If none of the hosts can be reached, the connection fails. If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.
If a password file is used, you can have different passwords for different hosts. All the other connection options are the same for every host in the list; it is not possible to e.g., specify different usernames for different hosts.
The currently recognized parameter key words are:
host
Name of host to connect to. If a host name looks like an absolute path
name, it specifies Unix-domain communication rather than TCP/IP
communication; the value is the name of the directory in which the
socket file is stored. (On Unix, an absolute path name begins with a
slash. On Windows, paths starting with drive letters are also
recognized.) If the host name starts with @
, it is
taken as a Unix-domain socket in the abstract namespace (currently
supported on Linux and Windows).
The default behavior when host
is not
specified, or is empty, is to connect to a Unix-domain
socket in
/tmp
(or whatever socket directory was specified
when PostgreSQL was built). On Windows and
on machines without Unix-domain sockets, the default is to connect to
localhost
.
A comma-separated list of host names is also accepted, in which case each host name in the list is tried in order; an empty item in the list selects the default behavior as explained above. See Section 34.1.1.3 for details.
hostaddr
Numeric IP address of host to connect to. This should be in the
standard IPv4 address format, e.g., 172.28.40.9
. If
your machine supports IPv6, you can also use those addresses.
TCP/IP communication is
always used when a nonempty string is specified for this parameter.
If this parameter is not specified, the value of host
will be looked up to find the corresponding IP address — or, if
host
specifies an IP address, that value will be
used directly.
Using hostaddr
allows the
application to avoid a host name look-up, which might be important
in applications with time constraints. However, a host name is
required for GSSAPI or SSPI authentication
methods, as well as for verify-full
SSL
certificate verification. The following rules are used:
If host
is specified
without hostaddr
, a host name lookup occurs.
(When using PQconnectPoll
, the lookup occurs
when PQconnectPoll
first considers this host
name, and it may cause PQconnectPoll
to block
for a significant amount of time.)
If hostaddr
is specified without host
,
the value for hostaddr
gives the server network address.
The connection attempt will fail if the authentication
method requires a host name.
If both host
and hostaddr
are specified,
the value for hostaddr
gives the server network address.
The value for host
is ignored unless the
authentication method requires it, in which case it will be
used as the host name.
Note that authentication is likely to fail if host
is not the name of the server at network address hostaddr
.
Also, when both host
and hostaddr
are specified, host
is used to identify the connection in a password file (see
Section 34.16).
A comma-separated list of hostaddr
values is also
accepted, in which case each host in the list is tried in order.
An empty item in the list causes the corresponding host name to be
used, or the default host name if that is empty as well. See
Section 34.1.1.3 for details.
Without either a host name or host address,
libpq will connect using a local
Unix-domain socket; or on Windows and on machines without Unix-domain
sockets, it will attempt to connect to localhost
.
port
Port number to connect to at the server host, or socket file
name extension for Unix-domain
connections.
If multiple hosts were given in the host
or
hostaddr
parameters, this parameter may specify a
comma-separated list of ports of the same length as the host list, or
it may specify a single port number to be used for all hosts.
An empty string, or an empty item in a comma-separated list,
specifies the default port number established
when PostgreSQL was built.
dbname
The database name. Defaults to be the same as the user name. In certain contexts, the value is checked for extended formats; see Section 34.1.1 for more details on those.
user
PostgreSQL user name to connect as. Defaults to be the same as the operating system name of the user running the application.
password
Password to be used if the server demands password authentication.
passfile
Specifies the name of the file used to store passwords
(see Section 34.16).
Defaults to ~/.pgpass
, or
%APPDATA%\postgresql\pgpass.conf
on Microsoft Windows.
(No error is reported if this file does not exist.)
channel_binding
This option controls the client's use of channel binding. A setting
of require
means that the connection must employ
channel binding, prefer
means that the client will
choose channel binding if available, and disable
prevents the use of channel binding. The default
is prefer
if
PostgreSQL is compiled with SSL support;
otherwise the default is disable
.
Channel binding is a method for the server to authenticate itself to
the client. It is only supported over SSL connections
with PostgreSQL 11 or later servers using
the SCRAM
authentication method.
connect_timeout
Maximum time to wait while connecting, in seconds (write as a decimal integer,
e.g., 10
). Zero, negative, or not specified means
wait indefinitely. The minimum allowed timeout is 2 seconds, therefore
a value of 1
is interpreted as 2
.
This timeout applies separately to each host name or IP address.
For example, if you specify two hosts and connect_timeout
is 5, each host will time out if no connection is made within 5
seconds, so the total time spent waiting for a connection might be
up to 10 seconds.
client_encoding
This sets the client_encoding
configuration parameter for this connection. In addition to
the values accepted by the corresponding server option, you
can use auto
to determine the right
encoding from the current locale in the client
(LC_CTYPE
environment variable on Unix
systems).
options
Specifies command-line options to send to the server at connection
start. For example, setting this to -c geqo=off
sets the
session's value of the geqo
parameter to
off
. Spaces within this string are considered to
separate command-line arguments, unless escaped with a backslash
(\
); write \\
to represent a literal
backslash. For a detailed discussion of the available
options, consult Chapter 20.
application_name
Specifies a value for the application_name configuration parameter.
fallback_application_name
Specifies a fallback value for the application_name configuration parameter.
This value will be used if no value has been given for
application_name
via a connection parameter or the
PGAPPNAME
environment variable. Specifying
a fallback name is useful in generic utility programs that
wish to set a default application name but allow it to be
overridden by the user.
keepalives
Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for connections made via a Unix-domain socket.
keepalives_idle
Controls the number of seconds of inactivity after which TCP should
send a keepalive message to the server. A value of zero uses the
system default. This parameter is ignored for connections made via a
Unix-domain socket, or if keepalives are disabled.
It is only supported on systems where TCP_KEEPIDLE
or
an equivalent socket option is available, and on Windows; on other
systems, it has no effect.
keepalives_interval
Controls the number of seconds after which a TCP keepalive message
that is not acknowledged by the server should be retransmitted. A
value of zero uses the system default. This parameter is ignored for
connections made via a Unix-domain socket, or if keepalives are disabled.
It is only supported on systems where TCP_KEEPINTVL
or
an equivalent socket option is available, and on Windows; on other
systems, it has no effect.
keepalives_count
Controls the number of TCP keepalives that can be lost before the
client's connection to the server is considered dead. A value of
zero uses the system default. This parameter is ignored for
connections made via a Unix-domain socket, or if keepalives are disabled.
It is only supported on systems where TCP_KEEPCNT
or
an equivalent socket option is available; on other systems, it has no
effect.
tcp_user_timeout
Controls the number of milliseconds that transmitted data may
remain unacknowledged before a connection is forcibly closed.
A value of zero uses the system default. This parameter is
ignored for connections made via a Unix-domain socket.
It is only supported on systems where TCP_USER_TIMEOUT
is available; on other systems, it has no effect.
replication
This option determines whether the connection should use the replication protocol instead of the normal protocol. This is what PostgreSQL replication connections as well as tools such as pg_basebackup use internally, but it can also be used by third-party applications. For a description of the replication protocol, consult Section 53.4.
The following values, which are case-insensitive, are supported:
true
, on
,
yes
, 1
The connection goes into physical replication mode.
database
The connection goes into logical replication mode, connecting to
the database specified in the dbname
parameter.
false
, off
,
no
, 0
The connection is a regular one, which is the default behavior.
In physical or logical replication mode, only the simple query protocol can be used.
gssencmode
This option determines whether or with what priority a secure GSS TCP/IP connection will be negotiated with the server. There are three modes:
disable
only try a non-GSSAPI-encrypted connection
prefer
(default)if there are GSSAPI credentials present (i.e., in a credentials cache), first try a GSSAPI-encrypted connection; if that fails or there are no credentials, try a non-GSSAPI-encrypted connection. This is the default when PostgreSQL has been compiled with GSSAPI support.
require
only try a GSSAPI-encrypted connection
gssencmode
is ignored for Unix domain socket
communication. If PostgreSQL is compiled
without GSSAPI support, using the require
option
will cause an error, while prefer
will be accepted
but libpq will not actually attempt
a GSSAPI-encrypted
connection.
sslmode
This option determines whether or with what priority a secure SSL TCP/IP connection will be negotiated with the server. There are six modes:
disable
only try a non-SSL connection
allow
first try a non-SSL connection; if that fails, try an SSL connection
prefer
(default)first try an SSL connection; if that fails, try a non-SSL connection
require
only try an SSL connection. If a root CA
file is present, verify the certificate in the same way as
if verify-ca
was specified
verify-ca
only try an SSL connection, and verify that the server certificate is issued by a trusted certificate authority (CA)
verify-full
only try an SSL connection, verify that the server certificate is issued by a trusted CA and that the requested server host name matches that in the certificate
See Section 34.19 for a detailed description of how these options work.
sslmode
is ignored for Unix domain socket
communication.
If PostgreSQL is compiled without SSL support,
using options require
, verify-ca
, or
verify-full
will cause an error, while
options allow
and prefer
will be
accepted but libpq will not actually attempt
an SSL
connection.
Note that if GSSAPI encryption is possible,
that will be used in preference to SSL
encryption, regardless of the value of sslmode
.
To force use of SSL encryption in an
environment that has working GSSAPI
infrastructure (such as a Kerberos server), also
set gssencmode
to disable
.
requiressl
This option is deprecated in favor of the sslmode
setting.
If set to 1, an SSL connection to the server
is required (this is equivalent to sslmode
require
). libpq will then refuse
to connect if the server does not accept an
SSL connection. If set to 0 (default),
libpq will negotiate the connection type with
the server (equivalent to sslmode
prefer
). This option is only available if
PostgreSQL is compiled with SSL support.
sslcompression
If set to 1, data sent over SSL connections will be compressed. If set to 0, compression will be disabled. The default is 0. This parameter is ignored if a connection without SSL is made.
SSL compression is nowadays considered insecure and its use is no longer recommended. OpenSSL 1.1.0 disables compression by default, and many operating system distributions disable it in prior versions as well, so setting this parameter to on will not have any effect if the server does not accept compression. PostgreSQL 14 disables compression completely in the backend.
If security is not a primary concern, compression can improve throughput if the network is the bottleneck. Disabling compression can improve response time and throughput if CPU performance is the limiting factor.
sslcert
This parameter specifies the file name of the client SSL
certificate, replacing the default
~/.postgresql/postgresql.crt
.
This parameter is ignored if an SSL connection is not made.
sslkey
This parameter specifies the location for the secret key used for
the client certificate. It can either specify a file name that will
be used instead of the default
~/.postgresql/postgresql.key
, or it can specify a key
obtained from an external “engine” (engines are
OpenSSL loadable modules). An external engine
specification should consist of a colon-separated engine name and
an engine-specific key identifier. This parameter is ignored if an
SSL connection is not made.
sslpassword
This parameter specifies the password for the secret key specified in
sslkey
, allowing client certificate private keys
to be stored in encrypted form on disk even when interactive passphrase
input is not practical.
Specifying this parameter with any non-empty value suppresses the
Enter PEM pass phrase:
prompt that OpenSSL will emit by default
when an encrypted client certificate key is provided to
libpq
.
If the key is not encrypted this parameter is ignored. The parameter has no effect on keys specified by OpenSSL engines unless the engine uses the OpenSSL password callback mechanism for prompts.
There is no environment variable equivalent to this option, and no
facility for looking it up in .pgpass
. It can be
used in a service file connection definition. Users with
more sophisticated uses should consider using OpenSSL engines and
tools like PKCS#11 or USB crypto offload devices.
sslrootcert
This parameter specifies the name of a file containing SSL
certificate authority (CA) certificate(s).
If the file exists, the server's certificate will be verified
to be signed by one of these authorities. The default is
~/.postgresql/root.crt
.
sslcrl
This parameter specifies the file name of the SSL server certificate
revocation list (CRL). Certificates listed in this file, if it
exists, will be rejected while attempting to authenticate the
server's certificate. If neither
sslcrl nor
sslcrldir is set, this setting is
taken as
~/.postgresql/root.crl
.
sslcrldir
This parameter specifies the directory name of the SSL server certificate revocation list (CRL). Certificates listed in the files in this directory, if it exists, will be rejected while attempting to authenticate the server's certificate.
The directory needs to be prepared with the
OpenSSL command
openssl rehash
or c_rehash
. See
its documentation for details.
Both sslcrl
and sslcrldir
can be
specified together.
sslsni
If set to 1 (default), libpq sets the TLS extension “Server Name Indication” (SNI) on SSL-enabled connections. By setting this parameter to 0, this is turned off.
The Server Name Indication can be used by SSL-aware proxies to route connections without having to decrypt the SSL stream. (Note that this requires a proxy that is aware of the PostgreSQL protocol handshake, not just any SSL proxy.) However, SNI makes the destination host name appear in cleartext in the network traffic, so it might be undesirable in some cases.
requirepeer
This parameter specifies the operating-system user name of the
server, for example requirepeer=postgres
.
When making a Unix-domain socket connection, if this
parameter is set, the client checks at the beginning of the
connection that the server process is running under the specified
user name; if it is not, the connection is aborted with an error.
This parameter can be used to provide server authentication similar
to that available with SSL certificates on TCP/IP connections.
(Note that if the Unix-domain socket is in
/tmp
or another publicly writable location,
any user could start a server listening there. Use this parameter
to ensure that you are connected to a server run by a trusted user.)
This option is only supported on platforms for which the
peer
authentication method is implemented; see
Section 21.9.
ssl_min_protocol_version
This parameter specifies the minimum SSL/TLS protocol version to allow
for the connection. Valid values are TLSv1
,
TLSv1.1
, TLSv1.2
and
TLSv1.3
. The supported protocols depend on the
version of OpenSSL used, older versions
not supporting the most modern protocol versions. If not specified,
the default is TLSv1.2
, which satisfies industry
best practices as of this writing.
ssl_max_protocol_version
This parameter specifies the maximum SSL/TLS protocol version to allow
for the connection. Valid values are TLSv1
,
TLSv1.1
, TLSv1.2
and
TLSv1.3
. The supported protocols depend on the
version of OpenSSL used, older versions
not supporting the most modern protocol versions. If not set, this
parameter is ignored and the connection will use the maximum bound
defined by the backend, if set. Setting the maximum protocol version
is mainly useful for testing or if some component has issues working
with a newer protocol.
krbsrvname
Kerberos service name to use when authenticating with GSSAPI.
This must match the service name specified in the server
configuration for Kerberos authentication to succeed. (See also
Section 21.6.)
The default value is normally postgres
,
but that can be changed when
building PostgreSQL via
the --with-krb-srvnam
option
of configure.
In most environments, this parameter never needs to be changed.
Some Kerberos implementations might require a different service name,
such as Microsoft Active Directory which requires the service name
to be in upper case (POSTGRES
).
gsslib
GSS library to use for GSSAPI authentication.
Currently this is disregarded except on Windows builds that include
both GSSAPI and SSPI support. In that case, set
this to gssapi
to cause libpq to use the GSSAPI
library for authentication instead of the default SSPI.
service
Service name to use for additional parameters. It specifies a service
name in pg_service.conf
that holds additional connection parameters.
This allows applications to specify only a service name so connection parameters
can be centrally maintained. See Section 34.17.
target_session_attrs
This option determines whether the session must have certain properties to be acceptable. It's typically used in combination with multiple host names to select the first acceptable alternative among several hosts. There are six modes:
any
(default)any successful connection is acceptable
read-write
session must accept read-write transactions by default (that
is, the server must not be in hot standby mode and
the default_transaction_read_only
parameter
must be off
)
read-only
session must not accept read-write transactions by default (the converse)
primary
server must not be in hot standby mode
standby
server must be in hot standby mode
prefer-standby
first try to find a standby server, but if none of the listed
hosts is a standby server, try again in any
mode
These functions can be used to interrogate the status of an existing database connection object.
libpq application programmers should be careful to
maintain the PGconn
abstraction. Use the accessor
functions described below to get at the contents of PGconn
.
Reference to internal PGconn
fields using
libpq-int.h
is not recommended because they are subject to change
in the future.
The following functions return parameter values established at connection.
These values are fixed for the life of the connection. If a multi-host
connection string is used, the values of PQhost
,
PQport
, and PQpass
can change if a new connection
is established using the same PGconn
object. Other values
are fixed for the lifetime of the PGconn
object.
PQdb
Returns the database name of the connection.
char *PQdb(const PGconn *conn);
PQuser
Returns the user name of the connection.
char *PQuser(const PGconn *conn);
PQpass
Returns the password of the connection.
char *PQpass(const PGconn *conn);
PQpass
will return either the password specified
in the connection parameters, or if there was none and the password
was obtained from the password
file, it will return that. In the latter case,
if multiple hosts were specified in the connection parameters, it is
not possible to rely on the result of PQpass
until
the connection is established. The status of the connection can be
checked using the function PQstatus
.
PQhost
Returns the server host name of the active connection.
This can be a host name, an IP address, or a directory path if the
connection is via Unix socket. (The path case can be distinguished
because it will always be an absolute path, beginning
with /
.)
char *PQhost(const PGconn *conn);
If the connection parameters specified both host
and
hostaddr
, then PQhost
will
return the host
information. If only
hostaddr
was specified, then that is returned.
If multiple hosts were specified in the connection parameters,
PQhost
returns the host actually connected to.
PQhost
returns NULL
if the
conn
argument is NULL
.
Otherwise, if there is an error producing the host information (perhaps
if the connection has not been fully established or there was an
error), it returns an empty string.
If multiple hosts were specified in the connection parameters, it is
not possible to rely on the result of PQhost
until
the connection is established. The status of the connection can be
checked using the function PQstatus
.
PQhostaddr
Returns the server IP address of the active connection.
This can be the address that a host name resolved to,
or an IP address provided through the hostaddr
parameter.
char *PQhostaddr(const PGconn *conn);
PQhostaddr
returns NULL
if the
conn
argument is NULL
.
Otherwise, if there is an error producing the host information
(perhaps if the connection has not been fully established or
there was an error), it returns an empty string.
PQport
Returns the port of the active connection.
char *PQport(const PGconn *conn);
If multiple ports were specified in the connection parameters,
PQport
returns the port actually connected to.
PQport
returns NULL
if the
conn
argument is NULL
.
Otherwise, if there is an error producing the port information (perhaps
if the connection has not been fully established or there was an
error), it returns an empty string.
If multiple ports were specified in the connection parameters, it is
not possible to rely on the result of PQport
until
the connection is established. The status of the connection can be
checked using the function PQstatus
.
PQtty
This function no longer does anything, but it remains for backwards
compatibility. The function always return an empty string, or
NULL
if the conn
argument is
NULL
.
char *PQtty(const PGconn *conn);
PQoptions
Returns the command-line options passed in the connection request.
char *PQoptions(const PGconn *conn);
The following functions return status data that can change as operations
are executed on the PGconn
object.
PQstatus
Returns the status of the connection.
ConnStatusType PQstatus(const PGconn *conn);
The status can be one of a number of values. However, only two of
these are seen outside of an asynchronous connection procedure:
CONNECTION_OK
and
CONNECTION_BAD
. A good connection to the database
has the status CONNECTION_OK
. A failed
connection attempt is signaled by status
CONNECTION_BAD
. Ordinarily, an OK status will
remain so until PQfinish
, but a communications
failure might result in the status changing to
CONNECTION_BAD
prematurely. In that case the
application could try to recover by calling
PQreset
.
See the entry for PQconnectStartParams
, PQconnectStart
and PQconnectPoll
with regards to other status codes that
might be returned.
PQtransactionStatus
Returns the current in-transaction status of the server.
PGTransactionStatusType PQtransactionStatus(const PGconn *conn);
The status can be PQTRANS_IDLE
(currently idle),
PQTRANS_ACTIVE
(a command is in progress),
PQTRANS_INTRANS
(idle, in a valid transaction block),
or PQTRANS_INERROR
(idle, in a failed transaction block).
PQTRANS_UNKNOWN
is reported if the connection is bad.
PQTRANS_ACTIVE
is reported only when a query
has been sent to the server and not yet completed.
PQparameterStatus
Looks up a current parameter setting of the server.
const char *PQparameterStatus(const PGconn *conn, const char *paramName);
Certain parameter values are reported by the server automatically at
connection startup or whenever their values change.
PQparameterStatus
can be used to interrogate these settings.
It returns the current value of a parameter if known, or NULL
if the parameter is not known.
Parameters reported as of the current release include
server_version
,
server_encoding
,
client_encoding
,
application_name
,
default_transaction_read_only
,
in_hot_standby
,
is_superuser
,
session_authorization
,
DateStyle
,
IntervalStyle
,
TimeZone
,
integer_datetimes
, and
standard_conforming_strings
.
(server_encoding
, TimeZone
, and
integer_datetimes
were not reported by releases before 8.0;
standard_conforming_strings
was not reported by releases
before 8.1;
IntervalStyle
was not reported by releases before 8.4;
application_name
was not reported by releases before
9.0;
default_transaction_read_only
and
in_hot_standby
were not reported by releases before
14.)
Note that
server_version
,
server_encoding
and
integer_datetimes
cannot change after startup.
If no value for standard_conforming_strings
is reported,
applications can assume it is off
, that is, backslashes
are treated as escapes in string literals. Also, the presence of
this parameter can be taken as an indication that the escape string
syntax (E'...'
) is accepted.
Although the returned pointer is declared const
, it in fact
points to mutable storage associated with the PGconn
structure.
It is unwise to assume the pointer will remain valid across queries.
PQprotocolVersion
Interrogates the frontend/backend protocol being used.
int PQprotocolVersion(const PGconn *conn);
Applications might wish to use this function to determine whether certain features are supported. Currently, the possible values are 3 (3.0 protocol), or zero (connection bad). The protocol version will not change after connection startup is complete, but it could theoretically change during a connection reset. The 3.0 protocol is supported by PostgreSQL server versions 7.4 and above.
PQserverVersion
Returns an integer representing the server version.
int PQserverVersion(const PGconn *conn);
Applications might use this function to determine the version of the database server they are connected to. The result is formed by multiplying the server's major version number by 10000 and adding the minor version number. For example, version 10.1 will be returned as 100001, and version 11.0 will be returned as 110000. Zero is returned if the connection is bad.
Prior to major version 10, PostgreSQL used
three-part version numbers in which the first two parts together
represented the major version. For those
versions, PQserverVersion
uses two digits for each
part; for example version 9.1.5 will be returned as 90105, and
version 9.2.0 will be returned as 90200.
Therefore, for purposes of determining feature compatibility,
applications should divide the result of PQserverVersion
by 100 not 10000 to determine a logical major version number.
In all release series, only the last two digits differ between
minor releases (bug-fix releases).
PQerrorMessage
Returns the error message most recently generated by an operation on the connection.
char *PQerrorMessage(const PGconn *conn);
Nearly all libpq functions will set a message for
PQerrorMessage
if they fail. Note that by
libpq convention, a nonempty
PQerrorMessage
result can consist of multiple lines,
and will include a trailing newline. The caller should not free
the result directly. It will be freed when the associated
PGconn
handle is passed to
PQfinish
. The result string should not be
expected to remain the same across operations on the
PGconn
structure.
PQsocket
Obtains the file descriptor number of the connection socket to the server. A valid descriptor will be greater than or equal to 0; a result of -1 indicates that no server connection is currently open. (This will not change during normal operation, but could change during connection setup or reset.)
int PQsocket(const PGconn *conn);
PQbackendPID
Returns the process ID (PID) of the backend process handling this connection.
int PQbackendPID(const PGconn *conn);
The backend PID is useful for debugging
purposes and for comparison to NOTIFY
messages (which include the PID of the
notifying backend process). Note that the
PID belongs to a process executing on the
database server host, not the local host!
PQconnectionNeedsPassword
Returns true (1) if the connection authentication method required a password, but none was available. Returns false (0) if not.
int PQconnectionNeedsPassword(const PGconn *conn);
This function can be applied after a failed connection attempt to decide whether to prompt the user for a password.
PQconnectionUsedPassword
Returns true (1) if the connection authentication method used a password. Returns false (0) if not.
int PQconnectionUsedPassword(const PGconn *conn);
This function can be applied after either a failed or successful connection attempt to detect whether the server demanded a password.
The following functions return information related to SSL. This information usually doesn't change after a connection is established.
PQsslInUse
Returns true (1) if the connection uses SSL, false (0) if not.
int PQsslInUse(const PGconn *conn);
PQsslAttribute
Returns SSL-related information about the connection.
const char *PQsslAttribute(const PGconn *conn, const char *attribute_name);
The list of available attributes varies depending on the SSL library being used, and the type of connection. If an attribute is not available, returns NULL.
The following attributes are commonly available:
library
Name of the SSL implementation in use. (Currently, only
"OpenSSL"
is implemented)
protocol
SSL/TLS version in use. Common values
are "TLSv1"
, "TLSv1.1"
and "TLSv1.2"
, but an implementation may
return other strings if some other protocol is used.
key_bits
Number of key bits used by the encryption algorithm.
cipher
A short name of the ciphersuite used, e.g.,
"DHE-RSA-DES-CBC3-SHA"
. The names are specific
to each SSL implementation.
compression
Returns "on" if SSL compression is in use, else it returns "off".
PQsslAttributeNames
Return an array of SSL attribute names available. The array is terminated by a NULL pointer.
const char * const * PQsslAttributeNames(const PGconn *conn);
PQsslStruct
Return a pointer to an SSL-implementation-specific object describing the connection.
void *PQsslStruct(const PGconn *conn, const char *struct_name);
The struct(s) available depend on the SSL implementation in use.
For OpenSSL, there is one struct,
available under the name "OpenSSL", and it returns a pointer to the
OpenSSL SSL
struct.
To use this function, code along the following lines could be used:
#include <libpq-fe.h> #include <openssl/ssl.h> ... SSL *ssl; dbconn = PQconnectdb(...); ... ssl = PQsslStruct(dbconn, "OpenSSL"); if (ssl) { /* use OpenSSL functions to access ssl */ }
This structure can be used to verify encryption levels, check server certificates, and more. Refer to the OpenSSL documentation for information about this structure.
PQgetssl
Returns the SSL structure used in the connection, or null if SSL is not in use.
void *PQgetssl(const PGconn *conn);
This function is equivalent to PQsslStruct(conn, "OpenSSL")
. It should
not be used in new applications, because the returned struct is
specific to OpenSSL and will not be
available if another SSL implementation is used.
To check if a connection uses SSL, call
PQsslInUse
instead, and for more details about the
connection, use PQsslAttribute
.
Once a connection to a database server has been successfully established, the functions described here are used to perform SQL queries and commands.
PQexec
Submits a command to the server and waits for the result.
PGresult *PQexec(PGconn *conn, const char *command);
Returns a PGresult
pointer or possibly a null
pointer. A non-null pointer will generally be returned except in
out-of-memory conditions or serious errors such as inability to send
the command to the server. The PQresultStatus
function
should be called to check the return value for any errors (including
the value of a null pointer, in which case it will return
PGRES_FATAL_ERROR
). Use
PQerrorMessage
to get more information about such
errors.
The command string can include multiple SQL commands
(separated by semicolons). Multiple queries sent in a single
PQexec
call are processed in a single transaction, unless
there are explicit BEGIN
/COMMIT
commands included in the query string to divide it into multiple
transactions. (See Section 53.2.2.1
for more details about how the server handles multi-query strings.)
Note however that the returned
PGresult
structure describes only the result
of the last command executed from the string. Should one of the
commands fail, processing of the string stops with it and the returned
PGresult
describes the error condition.
PQexecParams
Submits a command to the server and waits for the result, with the ability to pass parameters separately from the SQL command text.
PGresult *PQexecParams(PGconn *conn, const char *command, int nParams, const Oid *paramTypes, const char * const *paramValues, const int *paramLengths, const int *paramFormats, int resultFormat);
PQexecParams
is like PQexec
, but offers additional
functionality: parameter values can be specified separately from the command
string proper, and query results can be requested in either text or binary
format.
The function arguments are:
conn
The connection object to send the command through.
command
The SQL command string to be executed. If parameters are used,
they are referred to in the command string as $1
,
$2
, etc.
nParams
The number of parameters supplied; it is the length of the arrays
paramTypes[]
, paramValues[]
,
paramLengths[]
, and paramFormats[]
. (The
array pointers can be NULL
when nParams
is zero.)
paramTypes[]
Specifies, by OID, the data types to be assigned to the
parameter symbols. If paramTypes
is
NULL
, or any particular element in the array
is zero, the server infers a data type for the parameter symbol
in the same way it would do for an untyped literal string.
paramValues[]
Specifies the actual values of the parameters. A null pointer in this array means the corresponding parameter is null; otherwise the pointer points to a zero-terminated text string (for text format) or binary data in the format expected by the server (for binary format).
paramLengths[]
Specifies the actual data lengths of binary-format parameters. It is ignored for null parameters and text-format parameters. The array pointer can be null when there are no binary parameters.
paramFormats[]
Specifies whether parameters are text (put a zero in the array entry for the corresponding parameter) or binary (put a one in the array entry for the corresponding parameter). If the array pointer is null then all parameters are presumed to be text strings.
Values passed in binary format require knowledge of
the internal representation expected by the backend.
For example, integers must be passed in network byte
order. Passing numeric
values requires
knowledge of the server storage format, as implemented
in
src/backend/utils/adt/numeric.c::numeric_send()
and
src/backend/utils/adt/numeric.c::numeric_recv()
.
resultFormat
Specify zero to obtain results in text format, or one to obtain results in binary format. (There is not currently a provision to obtain different result columns in different formats, although that is possible in the underlying protocol.)
The primary advantage of PQexecParams
over
PQexec
is that parameter values can be separated from the
command string, thus avoiding the need for tedious and error-prone
quoting and escaping.
Unlike PQexec
, PQexecParams
allows at most
one SQL command in the given string. (There can be semicolons in it,
but not more than one nonempty command.) This is a limitation of the
underlying protocol, but has some usefulness as an extra defense against
SQL-injection attacks.
Specifying parameter types via OIDs is tedious, particularly if you prefer not to hard-wire particular OID values into your program. However, you can avoid doing so even in cases where the server by itself cannot determine the type of the parameter, or chooses a different type than you want. In the SQL command text, attach an explicit cast to the parameter symbol to show what data type you will send. For example:
SELECT * FROM mytable WHERE x = $1::bigint;
This forces parameter $1
to be treated as bigint
, whereas
by default it would be assigned the same type as x
. Forcing the
parameter type decision, either this way or by specifying a numeric type OID,
is strongly recommended when sending parameter values in binary format, because
binary format has less redundancy than text format and so there is less chance
that the server will detect a type mismatch mistake for you.
PQprepare
Submits a request to create a prepared statement with the given parameters, and waits for completion.
PGresult *PQprepare(PGconn *conn, const char *stmtName, const char *query, int nParams, const Oid *paramTypes);
PQprepare
creates a prepared statement for later
execution with PQexecPrepared
. This feature allows
commands to be executed repeatedly without being parsed and
planned each time; see PREPARE for details.
The function creates a prepared statement named
stmtName
from the query
string, which
must contain a single SQL command. stmtName
can be
""
to create an unnamed statement, in which case any
pre-existing unnamed statement is automatically replaced; otherwise
it is an error if the statement name is already defined in the
current session. If any parameters are used, they are referred
to in the query as $1
, $2
, etc.
nParams
is the number of parameters for which types
are pre-specified in the array paramTypes[]
. (The
array pointer can be NULL
when
nParams
is zero.) paramTypes[]
specifies, by OID, the data types to be assigned to the parameter
symbols. If paramTypes
is NULL
,
or any particular element in the array is zero, the server assigns
a data type to the parameter symbol in the same way it would do
for an untyped literal string. Also, the query can use parameter
symbols with numbers higher than nParams
; data types
will be inferred for these symbols as well. (See
PQdescribePrepared
for a means to find out
what data types were inferred.)
As with PQexec
, the result is normally a
PGresult
object whose contents indicate
server-side success or failure. A null result indicates
out-of-memory or inability to send the command at all. Use
PQerrorMessage
to get more information about
such errors.
Prepared statements for use with PQexecPrepared
can also
be created by executing SQL PREPARE
statements. Also, although there is no libpq
function for deleting a prepared statement, the SQL DEALLOCATE statement
can be used for that purpose.
PQexecPrepared
Sends a request to execute a prepared statement with given parameters, and waits for the result.
PGresult *PQexecPrepared(PGconn *conn, const char *stmtName, int nParams, const char * const *paramValues, const int *paramLengths, const int *paramFormats, int resultFormat);
PQexecPrepared
is like PQexecParams
,
but the command to be executed is specified by naming a
previously-prepared statement, instead of giving a query string.
This feature allows commands that will be used repeatedly to be
parsed and planned just once, rather than each time they are
executed. The statement must have been prepared previously in
the current session.
The parameters are identical to PQexecParams
, except that the
name of a prepared statement is given instead of a query string, and the
paramTypes[]
parameter is not present (it is not needed since
the prepared statement's parameter types were determined when it was created).
PQdescribePrepared
Submits a request to obtain information about the specified prepared statement, and waits for completion.
PGresult *PQdescribePrepared(PGconn *conn, const char *stmtName);
PQdescribePrepared
allows an application to obtain
information about a previously prepared statement.
stmtName
can be ""
or NULL
to reference
the unnamed statement, otherwise it must be the name of an existing
prepared statement. On success, a PGresult
with
status PGRES_COMMAND_OK
is returned. The
functions PQnparams
and
PQparamtype
can be applied to this
PGresult
to obtain information about the parameters
of the prepared statement, and the functions
PQnfields
, PQfname
,
PQftype
, etc provide information about the
result columns (if any) of the statement.
PQdescribePortal
Submits a request to obtain information about the specified portal, and waits for completion.
PGresult *PQdescribePortal(PGconn *conn, const char *portalName);
PQdescribePortal
allows an application to obtain
information about a previously created portal.
(libpq does not provide any direct access to
portals, but you can use this function to inspect the properties
of a cursor created with a DECLARE CURSOR
SQL command.)
portalName
can be ""
or NULL
to reference
the unnamed portal, otherwise it must be the name of an existing
portal. On success, a PGresult
with status
PGRES_COMMAND_OK
is returned. The functions
PQnfields
, PQfname
,
PQftype
, etc can be applied to the
PGresult
to obtain information about the result
columns (if any) of the portal.
The PGresult
structure encapsulates the result returned by the server.
libpq application programmers should be
careful to maintain the PGresult
abstraction.
Use the accessor functions below to get at the contents of
PGresult
. Avoid directly referencing the
fields of the PGresult
structure because they
are subject to change in the future.
PQresultStatus
Returns the result status of the command.
ExecStatusType PQresultStatus(const PGresult *res);
PQresultStatus
can return one of the following values:
PGRES_EMPTY_QUERY
The string sent to the server was empty.
PGRES_COMMAND_OK
Successful completion of a command returning no data.
PGRES_TUPLES_OK
Successful completion of a command returning data (such as
a SELECT
or SHOW
).
PGRES_COPY_OUT
Copy Out (from server) data transfer started.
PGRES_COPY_IN
Copy In (to server) data transfer started.
PGRES_BAD_RESPONSE
The server's response was not understood.
PGRES_NONFATAL_ERROR
A nonfatal error (a notice or warning) occurred.
PGRES_FATAL_ERROR
A fatal error occurred.
PGRES_COPY_BOTH
Copy In/Out (to and from server) data transfer started. This feature is currently used only for streaming replication, so this status should not occur in ordinary applications.
PGRES_SINGLE_TUPLE
The PGresult
contains a single result tuple
from the current command. This status occurs only when
single-row mode has been selected for the query
(see Section 34.6).
PGRES_PIPELINE_SYNC
The PGresult
represents a
synchronization point in pipeline mode, requested by
PQpipelineSync
.
This status occurs only when pipeline mode has been selected.
PGRES_PIPELINE_ABORTED
The PGresult
represents a pipeline that has
received an error from the server. PQgetResult
must be called repeatedly, and each time it will return this status code
until the end of the current pipeline, at which point it will return
PGRES_PIPELINE_SYNC
and normal processing can
resume.
If the result status is PGRES_TUPLES_OK
or
PGRES_SINGLE_TUPLE
, then
the functions described below can be used to retrieve the rows
returned by the query. Note that a SELECT
command that happens to retrieve zero rows still shows
PGRES_TUPLES_OK
.
PGRES_COMMAND_OK
is for commands that can never
return rows (INSERT
or UPDATE
without a RETURNING
clause,
etc.). A response of PGRES_EMPTY_QUERY
might
indicate a bug in the client software.
A result of status PGRES_NONFATAL_ERROR
will
never be returned directly by PQexec
or other
query execution functions; results of this kind are instead passed
to the notice processor (see Section 34.13).
PQresStatus
Converts the enumerated type returned by
PQresultStatus
into a string constant describing the
status code. The caller should not free the result.
char *PQresStatus(ExecStatusType status);
PQresultErrorMessage
Returns the error message associated with the command, or an empty string if there was no error.
char *PQresultErrorMessage(const PGresult *res);
If there was an error, the returned string will include a trailing
newline. The caller should not free the result directly. It will
be freed when the associated PGresult
handle is
passed to PQclear
.
Immediately following a PQexec
or
PQgetResult
call,
PQerrorMessage
(on the connection) will return
the same string as PQresultErrorMessage
(on
the result). However, a PGresult
will
retain its error message until destroyed, whereas the connection's
error message will change when subsequent operations are done.
Use PQresultErrorMessage
when you want to
know the status associated with a particular
PGresult
; use
PQerrorMessage
when you want to know the
status from the latest operation on the connection.
PQresultVerboseErrorMessage
Returns a reformatted version of the error message associated with
a PGresult
object.
char *PQresultVerboseErrorMessage(const PGresult *res, PGVerbosity verbosity, PGContextVisibility show_context);
In some situations a client might wish to obtain a more detailed
version of a previously-reported error.
PQresultVerboseErrorMessage
addresses this need
by computing the message that would have been produced
by PQresultErrorMessage
if the specified
verbosity settings had been in effect for the connection when the
given PGresult
was generated. If
the PGresult
is not an error result,
“PGresult is not an error result” is reported instead.
The returned string includes a trailing newline.
Unlike most other functions for extracting data from
a PGresult
, the result of this function is a freshly
allocated string. The caller must free it
using PQfreemem()
when the string is no longer needed.
A NULL return is possible if there is insufficient memory.
PQresultErrorField
Returns an individual field of an error report.
char *PQresultErrorField(const PGresult *res, int fieldcode);
fieldcode
is an error field identifier; see the symbols
listed below. NULL
is returned if the
PGresult
is not an error or warning result,
or does not include the specified field. Field values will normally
not include a trailing newline. The caller should not free the
result directly. It will be freed when the
associated PGresult
handle is passed to
PQclear
.
The following field codes are available:
PG_DIAG_SEVERITY
The severity; the field contents are ERROR
,
FATAL
, or PANIC
(in an error message),
or WARNING
, NOTICE
, DEBUG
,
INFO
, or LOG
(in a notice message), or
a localized translation of one of these. Always present.
PG_DIAG_SEVERITY_NONLOCALIZED
The severity; the field contents are ERROR
,
FATAL
, or PANIC
(in an error message),
or WARNING
, NOTICE
, DEBUG
,
INFO
, or LOG
(in a notice message).
This is identical to the PG_DIAG_SEVERITY
field except
that the contents are never localized. This is present only in
reports generated by PostgreSQL versions 9.6
and later.
PG_DIAG_SQLSTATE
The SQLSTATE code for the error. The SQLSTATE code identifies the type of error that has occurred; it can be used by front-end applications to perform specific operations (such as error handling) in response to a particular database error. For a list of the possible SQLSTATE codes, see Appendix A. This field is not localizable, and is always present.
PG_DIAG_MESSAGE_PRIMARY
The primary human-readable error message (typically one line). Always present.
PG_DIAG_MESSAGE_DETAIL
Detail: an optional secondary error message carrying more detail about the problem. Might run to multiple lines.
PG_DIAG_MESSAGE_HINT
Hint: an optional suggestion what to do about the problem. This is intended to differ from detail in that it offers advice (potentially inappropriate) rather than hard facts. Might run to multiple lines.
PG_DIAG_STATEMENT_POSITION
A string containing a decimal integer indicating an error cursor position as an index into the original statement string. The first character has index 1, and positions are measured in characters not bytes.
PG_DIAG_INTERNAL_POSITION
This is defined the same as the
PG_DIAG_STATEMENT_POSITION
field, but it is used
when the cursor position refers to an internally generated
command rather than the one submitted by the client. The
PG_DIAG_INTERNAL_QUERY
field will always appear when
this field appears.
PG_DIAG_INTERNAL_QUERY
The text of a failed internally-generated command. This could be, for example, an SQL query issued by a PL/pgSQL function.
PG_DIAG_CONTEXT
An indication of the context in which the error occurred. Presently this includes a call stack traceback of active procedural language functions and internally-generated queries. The trace is one entry per line, most recent first.
PG_DIAG_SCHEMA_NAME
If the error was associated with a specific database object, the name of the schema containing that object, if any.
PG_DIAG_TABLE_NAME
If the error was associated with a specific table, the name of the table. (Refer to the schema name field for the name of the table's schema.)
PG_DIAG_COLUMN_NAME
If the error was associated with a specific table column, the name of the column. (Refer to the schema and table name fields to identify the table.)
PG_DIAG_DATATYPE_NAME
If the error was associated with a specific data type, the name of the data type. (Refer to the schema name field for the name of the data type's schema.)
PG_DIAG_CONSTRAINT_NAME
If the error was associated with a specific constraint, the name of the constraint. Refer to fields listed above for the associated table or domain. (For this purpose, indexes are treated as constraints, even if they weren't created with constraint syntax.)
PG_DIAG_SOURCE_FILE
The file name of the source-code location where the error was reported.
PG_DIAG_SOURCE_LINE
The line number of the source-code location where the error was reported.
PG_DIAG_SOURCE_FUNCTION
The name of the source-code function reporting the error.
The fields for schema name, table name, column name, data type name, and constraint name are supplied only for a limited number of error types; see Appendix A. Do not assume that the presence of any of these fields guarantees the presence of another field. Core error sources observe the interrelationships noted above, but user-defined functions may use these fields in other ways. In the same vein, do not assume that these fields denote contemporary objects in the current database.
The client is responsible for formatting displayed information to meet its needs; in particular it should break long lines as needed. Newline characters appearing in the error message fields should be treated as paragraph breaks, not line breaks.
Errors generated internally by libpq will have severity and primary message, but typically no other fields.
Note that error fields are only available from
PGresult
objects, not
PGconn
objects; there is no
PQerrorField
function.
PQclear
Frees the storage associated with a
PGresult
. Every command result should be
freed via PQclear
when it is no longer
needed.
void PQclear(PGresult *res);
You can keep a PGresult
object around for
as long as you need it; it does not go away when you issue a new
command, nor even if you close the connection. To get rid of it,
you must call PQclear
. Failure to do this
will result in memory leaks in your application.
These functions are used to extract information from a
PGresult
object that represents a successful
query result (that is, one that has status
PGRES_TUPLES_OK
or PGRES_SINGLE_TUPLE
).
They can also be used to extract
information from a successful Describe operation: a Describe's result
has all the same column information that actual execution of the query
would provide, but it has zero rows. For objects with other status values,
these functions will act as though the result has zero rows and zero columns.
PQntuples
Returns the number of rows (tuples) in the query result.
(Note that PGresult
objects are limited to no more
than INT_MAX
rows, so an int
result is
sufficient.)
int PQntuples(const PGresult *res);
PQnfields
Returns the number of columns (fields) in each row of the query result.
int PQnfields(const PGresult *res);
PQfname
Returns the column name associated with the given column number.
Column numbers start at 0. The caller should not free the result
directly. It will be freed when the associated
PGresult
handle is passed to
PQclear
.
char *PQfname(const PGresult *res, int column_number);
NULL
is returned if the column number is out of range.
PQfnumber
Returns the column number associated with the given column name.
int PQfnumber(const PGresult *res, const char *column_name);
-1 is returned if the given name does not match any column.
The given name is treated like an identifier in an SQL command, that is, it is downcased unless double-quoted. For example, given a query result generated from the SQL command:
SELECT 1 AS FOO, 2 AS "BAR";
we would have the results:
PQfname(res, 0) foo PQfname(res, 1) BAR PQfnumber(res, "FOO") 0 PQfnumber(res, "foo") 0 PQfnumber(res, "BAR") -1 PQfnumber(res, "\"BAR\"") 1
PQftable
Returns the OID of the table from which the given column was fetched. Column numbers start at 0.
Oid PQftable(const PGresult *res, int column_number);
InvalidOid
is returned if the column number is out of range,
or if the specified column is not a simple reference to a table column.
You can query the system table pg_class
to determine
exactly which table is referenced.
The type Oid
and the constant
InvalidOid
will be defined when you include
the libpq header file. They will both
be some integer type.
PQftablecol
Returns the column number (within its table) of the column making up the specified query result column. Query-result column numbers start at 0, but table columns have nonzero numbers.
int PQftablecol(const PGresult *res, int column_number);
Zero is returned if the column number is out of range, or if the specified column is not a simple reference to a table column.
PQfformat
Returns the format code indicating the format of the given column. Column numbers start at 0.
int PQfformat(const PGresult *res, int column_number);
Format code zero indicates textual data representation, while format code one indicates binary representation. (Other codes are reserved for future definition.)
PQftype
Returns the data type associated with the given column number. The integer returned is the internal OID number of the type. Column numbers start at 0.
Oid PQftype(const PGresult *res, int column_number);
You can query the system table pg_type
to
obtain the names and properties of the various data types. The
OIDs of the built-in data types are defined
in the file catalog/pg_type_d.h
in the PostgreSQL
installation's include
directory.
PQfmod
Returns the type modifier of the column associated with the given column number. Column numbers start at 0.
int PQfmod(const PGresult *res, int column_number);
The interpretation of modifier values is type-specific; they typically indicate precision or size limits. The value -1 is used to indicate “no information available”. Most data types do not use modifiers, in which case the value is always -1.
PQfsize
Returns the size in bytes of the column associated with the given column number. Column numbers start at 0.
int PQfsize(const PGresult *res, int column_number);
PQfsize
returns the space allocated for this column
in a database row, in other words the size of the server's
internal representation of the data type. (Accordingly, it is
not really very useful to clients.) A negative value indicates
the data type is variable-length.
PQbinaryTuples
Returns 1 if the PGresult
contains binary data
and 0 if it contains text data.
int PQbinaryTuples(const PGresult *res);
This function is deprecated (except for its use in connection with
COPY
), because it is possible for a single
PGresult
to contain text data in some columns and
binary data in others. PQfformat
is preferred.
PQbinaryTuples
returns 1 only if all columns of the
result are binary (format 1).
PQgetvalue
Returns a single field value of one row of a
PGresult
. Row and column numbers start
at 0. The caller should not free the result directly. It will
be freed when the associated PGresult
handle is
passed to PQclear
.
char *PQgetvalue(const PGresult *res, int row_number, int column_number);
For data in text format, the value returned by
PQgetvalue
is a null-terminated character
string representation of the field value. For data in binary
format, the value is in the binary representation determined by
the data type's typsend
and typreceive
functions. (The value is actually followed by a zero byte in
this case too, but that is not ordinarily useful, since the
value is likely to contain embedded nulls.)
An empty string is returned if the field value is null. See
PQgetisnull
to distinguish null values from
empty-string values.
The pointer returned by PQgetvalue
points
to storage that is part of the PGresult
structure. One should not modify the data it points to, and one
must explicitly copy the data into other storage if it is to be
used past the lifetime of the PGresult
structure itself.
PQgetisnull
Tests a field for a null value. Row and column numbers start at 0.
int PQgetisnull(const PGresult *res, int row_number, int column_number);
This function returns 1 if the field is null and 0 if it
contains a non-null value. (Note that
PQgetvalue
will return an empty string,
not a null pointer, for a null field.)
PQgetlength
Returns the actual length of a field value in bytes. Row and column numbers start at 0.
int PQgetlength(const PGresult *res, int row_number, int column_number);
This is the actual data length for the particular data value,
that is, the size of the object pointed to by
PQgetvalue
. For text data format this is
the same as strlen()
. For binary format this is
essential information. Note that one should not
rely on PQfsize
to obtain the actual data
length.
PQnparams
Returns the number of parameters of a prepared statement.
int PQnparams(const PGresult *res);
This function is only useful when inspecting the result of
PQdescribePrepared
. For other types of queries it
will return zero.
PQparamtype
Returns the data type of the indicated statement parameter. Parameter numbers start at 0.
Oid PQparamtype(const PGresult *res, int param_number);
This function is only useful when inspecting the result of
PQdescribePrepared
. For other types of queries it
will return zero.
PQprint
Prints out all the rows and, optionally, the column names to the specified output stream.
void PQprint(FILE *fout, /* output stream */ const PGresult *res, const PQprintOpt *po); typedef struct { pqbool header; /* print output field headings and row count */ pqbool align; /* fill align the fields */ pqbool standard; /* old brain dead format */ pqbool html3; /* output HTML tables */ pqbool expanded; /* expand tables */ pqbool pager; /* use pager for output if needed */ char *fieldSep; /* field separator */ char *tableOpt; /* attributes for HTML table element */ char *caption; /* HTML table caption */ char **fieldName; /* null-terminated array of replacement field names */ } PQprintOpt;
This function was formerly used by psql to print query results, but this is no longer the case. Note that it assumes all the data is in text format.
These functions are used to extract other information from
PGresult
objects.
PQcmdStatus
Returns the command status tag from the SQL command that generated
the PGresult
.
char *PQcmdStatus(PGresult *res);
Commonly this is just the name of the command, but it might include
additional data such as the number of rows processed. The caller
should not free the result directly. It will be freed when the
associated PGresult
handle is passed to
PQclear
.
PQcmdTuples
Returns the number of rows affected by the SQL command.
char *PQcmdTuples(PGresult *res);
This function returns a string containing the number of rows
affected by the SQL statement that generated the
PGresult
. This function can only be used following
the execution of a SELECT
, CREATE TABLE AS
,
INSERT
, UPDATE
, DELETE
,
MOVE
, FETCH
, or COPY
statement,
or an EXECUTE
of a prepared query that contains an
INSERT
, UPDATE
, or DELETE
statement.
If the command that generated the PGresult
was anything
else, PQcmdTuples
returns an empty string. The caller
should not free the return value directly. It will be freed when
the associated PGresult
handle is passed to
PQclear
.
PQoidValue
Returns the OID
of the inserted row, if the SQL command was an
INSERT
that inserted exactly one row into a table that
has OIDs, or a EXECUTE
of a prepared query containing
a suitable INSERT
statement. Otherwise, this function
returns InvalidOid
. This function will also
return InvalidOid
if the table affected by the
INSERT
statement does not contain OIDs.
Oid PQoidValue(const PGresult *res);
PQoidStatus
This function is deprecated in favor of
PQoidValue
and is not thread-safe.
It returns a string with the OID of the inserted row, while
PQoidValue
returns the OID value.
char *PQoidStatus(const PGresult *res);
PQescapeLiteral
char *PQescapeLiteral(PGconn *conn, const char *str, size_t length);
PQescapeLiteral
escapes a string for
use within an SQL command. This is useful when inserting data
values as literal constants in SQL commands. Certain characters
(such as quotes and backslashes) must be escaped to prevent them
from being interpreted specially by the SQL parser.
PQescapeLiteral
performs this operation.
PQescapeLiteral
returns an escaped version of the
str
parameter in memory allocated with
malloc()
. This memory should be freed using
PQfreemem()
when the result is no longer needed.
A terminating zero byte is not required, and should not be
counted in length
. (If a terminating zero byte is found
before length
bytes are processed,
PQescapeLiteral
stops at the zero; the behavior is
thus rather like strncpy
.) The
return string has all special characters replaced so that they can
be properly processed by the PostgreSQL
string literal parser. A terminating zero byte is also added. The
single quotes that must surround PostgreSQL
string literals are included in the result string.
On error, PQescapeLiteral
returns NULL
and a suitable
message is stored in the conn
object.
It is especially important to do proper escaping when handling strings that were received from an untrustworthy source. Otherwise there is a security risk: you are vulnerable to “SQL injection” attacks wherein unwanted SQL commands are fed to your database.
Note that it is neither necessary nor correct to do escaping when a data
value is passed as a separate parameter in PQexecParams
or
its sibling routines.
PQescapeIdentifier
char *PQescapeIdentifier(PGconn *conn, const char *str, size_t length);
PQescapeIdentifier
escapes a string for
use as an SQL identifier, such as a table, column, or function name.
This is useful when a user-supplied identifier might contain
special characters that would otherwise not be interpreted as part
of the identifier by the SQL parser, or when the identifier might
contain upper case characters whose case should be preserved.
PQescapeIdentifier
returns a version of the
str
parameter escaped as an SQL identifier
in memory allocated with malloc()
. This memory must be
freed using PQfreemem()
when the result is no longer
needed. A terminating zero byte is not required, and should not be
counted in length
. (If a terminating zero byte is found
before length
bytes are processed,
PQescapeIdentifier
stops at the zero; the behavior is
thus rather like strncpy
.) The
return string has all special characters replaced so that it
will be properly processed as an SQL identifier. A terminating zero byte
is also added. The return string will also be surrounded by double
quotes.
On error, PQescapeIdentifier
returns NULL
and a suitable
message is stored in the conn
object.
As with string literals, to prevent SQL injection attacks, SQL identifiers must be escaped when they are received from an untrustworthy source.
PQescapeStringConn
size_t PQescapeStringConn(PGconn *conn, char *to, const char *from, size_t length, int *error);
PQescapeStringConn
escapes string literals, much like
PQescapeLiteral
. Unlike PQescapeLiteral
,
the caller is responsible for providing an appropriately sized buffer.
Furthermore, PQescapeStringConn
does not generate the
single quotes that must surround PostgreSQL string
literals; they should be provided in the SQL command that the
result is inserted into. The parameter from
points to
the first character of the string that is to be escaped, and the
length
parameter gives the number of bytes in this
string. A terminating zero byte is not required, and should not be
counted in length
. (If a terminating zero byte is found
before length
bytes are processed,
PQescapeStringConn
stops at the zero; the behavior is
thus rather like strncpy
.) to
shall point
to a buffer that is able to hold at least one more byte than twice
the value of length
, otherwise the behavior is undefined.
Behavior is likewise undefined if the to
and
from
strings overlap.
If the error
parameter is not NULL
, then
*error
is set to zero on success, nonzero on error.
Presently the only possible error conditions involve invalid multibyte
encoding in the source string. The output string is still generated
on error, but it can be expected that the server will reject it as
malformed. On error, a suitable message is stored in the
conn
object, whether or not error
is NULL
.
PQescapeStringConn
returns the number of bytes written
to to
, not including the terminating zero byte.
PQescapeString
PQescapeString
is an older, deprecated version of
PQescapeStringConn
.
size_t PQescapeString (char *to, const char *from, size_t length);
The only difference from PQescapeStringConn
is that
PQescapeString
does not take PGconn
or error
parameters.
Because of this, it cannot adjust its behavior depending on the
connection properties (such as character encoding) and therefore
it might give the wrong results. Also, it has no way
to report error conditions.
PQescapeString
can be used safely in
client programs that work with only one PostgreSQL
connection at a time (in this case it can find out what it needs to
know “behind the scenes”). In other contexts it is a security
hazard and should be avoided in favor of
PQescapeStringConn
.
PQescapeByteaConn
Escapes binary data for use within an SQL command with the type
bytea
. As with PQescapeStringConn
,
this is only used when inserting data directly into an SQL command string.
unsigned char *PQescapeByteaConn(PGconn *conn, const unsigned char *from, size_t from_length, size_t *to_length);
Certain byte values must be escaped when used as part of a
bytea
literal in an SQL statement.
PQescapeByteaConn
escapes bytes using
either hex encoding or backslash escaping. See Section 8.4 for more information.
The from
parameter points to the first
byte of the string that is to be escaped, and the
from_length
parameter gives the number of
bytes in this binary string. (A terminating zero byte is
neither necessary nor counted.) The to_length
parameter points to a variable that will hold the resultant
escaped string length. This result string length includes the terminating
zero byte of the result.
PQescapeByteaConn
returns an escaped version of the
from
parameter binary string in memory
allocated with malloc()
. This memory should be freed using
PQfreemem()
when the result is no longer needed. The
return string has all special characters replaced so that they can
be properly processed by the PostgreSQL
string literal parser, and the bytea
input function. A
terminating zero byte is also added. The single quotes that must
surround PostgreSQL string literals are
not part of the result string.
On error, a null pointer is returned, and a suitable error message
is stored in the conn
object. Currently, the only
possible error is insufficient memory for the result string.
PQescapeBytea
PQescapeBytea
is an older, deprecated version of
PQescapeByteaConn
.
unsigned char *PQescapeBytea(const unsigned char *from, size_t from_length, size_t *to_length);
The only difference from PQescapeByteaConn
is that
PQescapeBytea
does not take a PGconn
parameter. Because of this, PQescapeBytea
can
only be used safely in client programs that use a single
PostgreSQL connection at a time (in this case
it can find out what it needs to know “behind the
scenes”). It might give the wrong results if
used in programs that use multiple database connections (use
PQescapeByteaConn
in such cases).
PQunescapeBytea
Converts a string representation of binary data into binary data
— the reverse of PQescapeBytea
. This
is needed when retrieving bytea
data in text format,
but not when retrieving it in binary format.
unsigned char *PQunescapeBytea(const unsigned char *from, size_t *to_length);
The from
parameter points to a string
such as might be returned by PQgetvalue
when applied
to a bytea
column. PQunescapeBytea
converts this string representation into its binary representation.
It returns a pointer to a buffer allocated with
malloc()
, or NULL
on error, and puts the size of
the buffer in to_length
. The result must be
freed using PQfreemem
when it is no longer needed.
This conversion is not exactly the inverse of
PQescapeBytea
, because the string is not expected
to be “escaped” when received from PQgetvalue
.
In particular this means there is no need for string quoting considerations,
and so no need for a PGconn
parameter.
The PQexec
function is adequate for submitting
commands in normal, synchronous applications. It has a few
deficiencies, however, that can be of importance to some users:
PQexec
waits for the command to be completed.
The application might have other work to do (such as maintaining a
user interface), in which case it won't want to block waiting for
the response.
Since the execution of the client application is suspended while it waits for the result, it is hard for the application to decide that it would like to try to cancel the ongoing command. (It can be done from a signal handler, but not otherwise.)
PQexec
can return only one
PGresult
structure. If the submitted command
string contains multiple SQL commands, all but
the last PGresult
are discarded by
PQexec
.
PQexec
always collects the command's entire result,
buffering it in a single PGresult
. While
this simplifies error-handling logic for the application, it can be
impractical for results containing many rows.
Applications that do not like these limitations can instead use the
underlying functions that PQexec
is built from:
PQsendQuery
and PQgetResult
.
There are also
PQsendQueryParams
,
PQsendPrepare
,
PQsendQueryPrepared
,
PQsendDescribePrepared
, and
PQsendDescribePortal
,
which can be used with PQgetResult
to duplicate
the functionality of
PQexecParams
,
PQprepare
,
PQexecPrepared
,
PQdescribePrepared
, and
PQdescribePortal
respectively.
PQsendQuery
Submits a command to the server without waiting for the result(s).
1 is returned if the command was successfully dispatched and 0 if
not (in which case, use PQerrorMessage
to get more
information about the failure).
int PQsendQuery(PGconn *conn, const char *command);
After successfully calling PQsendQuery
, call
PQgetResult
one or more times to obtain the
results. PQsendQuery
cannot be called again
(on the same connection) until PQgetResult
has returned a null pointer, indicating that the command is done.
In pipeline mode, command strings containing more than one SQL command are disallowed.
PQsendQueryParams
Submits a command and separate parameters to the server without waiting for the result(s).
int PQsendQueryParams(PGconn *conn, const char *command, int nParams, const Oid *paramTypes, const char * const *paramValues, const int *paramLengths, const int *paramFormats, int resultFormat);
This is equivalent to PQsendQuery
except that
query parameters can be specified separately from the query string.
The function's parameters are handled identically to
PQexecParams
. Like
PQexecParams
, it allows only one command in the
query string.
PQsendPrepare
Sends a request to create a prepared statement with the given parameters, without waiting for completion.
int PQsendPrepare(PGconn *conn, const char *stmtName, const char *query, int nParams, const Oid *paramTypes);
This is an asynchronous version of PQprepare
: it
returns 1 if it was able to dispatch the request, and 0 if not.
After a successful call, call PQgetResult
to
determine whether the server successfully created the prepared
statement. The function's parameters are handled identically to
PQprepare
.
PQsendQueryPrepared
Sends a request to execute a prepared statement with given parameters, without waiting for the result(s).
int PQsendQueryPrepared(PGconn *conn, const char *stmtName, int nParams, const char * const *paramValues, const int *paramLengths, const int *paramFormats, int resultFormat);
This is similar to PQsendQueryParams
, but
the command to be executed is specified by naming a
previously-prepared statement, instead of giving a query string.
The function's parameters are handled identically to
PQexecPrepared
.
PQsendDescribePrepared
Submits a request to obtain information about the specified prepared statement, without waiting for completion.
int PQsendDescribePrepared(PGconn *conn, const char *stmtName);
This is an asynchronous version of PQdescribePrepared
:
it returns 1 if it was able to dispatch the request, and 0 if not.
After a successful call, call PQgetResult
to
obtain the results. The function's parameters are handled
identically to PQdescribePrepared
.
PQsendDescribePortal
Submits a request to obtain information about the specified portal, without waiting for completion.
int PQsendDescribePortal(PGconn *conn, const char *portalName);
This is an asynchronous version of PQdescribePortal
:
it returns 1 if it was able to dispatch the request, and 0 if not.
After a successful call, call PQgetResult
to
obtain the results. The function's parameters are handled
identically to PQdescribePortal
.
PQgetResult
Waits for the next result from a prior
PQsendQuery
,
PQsendQueryParams
,
PQsendPrepare
,
PQsendQueryPrepared
,
PQsendDescribePrepared
,
PQsendDescribePortal
, or
PQpipelineSync
call, and returns it.
A null pointer is returned when the command is complete and there
will be no more results.
PGresult *PQgetResult(PGconn *conn);
PQgetResult
must be called repeatedly until
it returns a null pointer, indicating that the command is done.
(If called when no command is active,
PQgetResult
will just return a null pointer
at once.) Each non-null result from
PQgetResult
should be processed using the
same PGresult
accessor functions previously
described. Don't forget to free each result object with
PQclear
when done with it. Note that
PQgetResult
will block only if a command is
active and the necessary response data has not yet been read by
PQconsumeInput
.
In pipeline mode, PQgetResult
will return normally
unless an error occurs; for any subsequent query sent after the one
that caused the error until (and excluding) the next synchronization point,
a special result of type PGRES_PIPELINE_ABORTED
will
be returned, and a null pointer will be returned after it.
When the pipeline synchronization point is reached, a result of type
PGRES_PIPELINE_SYNC
will be returned.
The result of the next query after the synchronization point follows
immediately (that is, no null pointer is returned after
the synchronization point.)
Even when PQresultStatus
indicates a fatal
error, PQgetResult
should be called until it
returns a null pointer, to allow libpq to
process the error information completely.
Using PQsendQuery
and
PQgetResult
solves one of
PQexec
's problems: If a command string contains
multiple SQL commands, the results of those commands
can be obtained individually. (This allows a simple form of overlapped
processing, by the way: the client can be handling the results of one
command while the server is still working on later queries in the same
command string.)
Another frequently-desired feature that can be obtained with
PQsendQuery
and PQgetResult
is retrieving large query results a row at a time. This is discussed
in Section 34.6.
By itself, calling PQgetResult
will still cause the client to block until the server completes the
next SQL command. This can be avoided by proper
use of two more functions:
PQconsumeInput
If input is available from the server, consume it.
int PQconsumeInput(PGconn *conn);
PQconsumeInput
normally returns 1 indicating
“no error”, but returns 0 if there was some kind of
trouble (in which case PQerrorMessage
can be
consulted). Note that the result does not say whether any input
data was actually collected. After calling
PQconsumeInput
, the application can check
PQisBusy
and/or
PQnotifies
to see if their state has changed.
PQconsumeInput
can be called even if the
application is not prepared to deal with a result or notification
just yet. The function will read available data and save it in
a buffer, thereby causing a select()
read-ready indication to go away. The application can thus use
PQconsumeInput
to clear the
select()
condition immediately, and then
examine the results at leisure.
PQisBusy
Returns 1 if a command is busy, that is,
PQgetResult
would block waiting for input.
A 0 return indicates that PQgetResult
can be
called with assurance of not blocking.
int PQisBusy(PGconn *conn);
PQisBusy
will not itself attempt to read data
from the server; therefore PQconsumeInput
must be invoked first, or the busy state will never end.
A typical application using these functions will have a main loop that
uses select()
or poll()
to wait for
all the conditions that it must respond to. One of the conditions
will be input available from the server, which in terms of
select()
means readable data on the file
descriptor identified by PQsocket
. When the main
loop detects input ready, it should call
PQconsumeInput
to read the input. It can then
call PQisBusy
, followed by
PQgetResult
if PQisBusy
returns false (0). It can also call PQnotifies
to detect NOTIFY
messages (see Section 34.9).
A client that uses
PQsendQuery
/PQgetResult
can also attempt to cancel a command that is still being processed
by the server; see Section 34.7. But regardless of
the return value of PQcancel
, the application
must continue with the normal result-reading sequence using
PQgetResult
. A successful cancellation will
simply cause the command to terminate sooner than it would have
otherwise.
By using the functions described above, it is possible to avoid
blocking while waiting for input from the database server. However,
it is still possible that the application will block waiting to send
output to the server. This is relatively uncommon but can happen if
very long SQL commands or data values are sent. (It is much more
probable if the application sends data via COPY IN
,
however.) To prevent this possibility and achieve completely
nonblocking database operation, the following additional functions
can be used.
PQsetnonblocking
Sets the nonblocking status of the connection.
int PQsetnonblocking(PGconn *conn, int arg);
Sets the state of the connection to nonblocking if
arg
is 1, or blocking if
arg
is 0. Returns 0 if OK, -1 if error.
In the nonblocking state, successful calls to
PQsendQuery
, PQputline
,
PQputnbytes
, PQputCopyData
,
and PQendcopy
will not block; their changes
are stored in the local output buffer until they are flushed.
Unsuccessful calls will return an error and must be retried.
Note that PQexec
does not honor nonblocking
mode; if it is called, it will act in blocking fashion anyway.
PQisnonblocking
Returns the blocking status of the database connection.
int PQisnonblocking(const PGconn *conn);
Returns 1 if the connection is set to nonblocking mode and 0 if blocking.
PQflush
Attempts to flush any queued output data to the server. Returns 0 if successful (or if the send queue is empty), -1 if it failed for some reason, or 1 if it was unable to send all the data in the send queue yet (this case can only occur if the connection is nonblocking).
int PQflush(PGconn *conn);
After sending any command or data on a nonblocking connection, call
PQflush
. If it returns 1, wait for the socket
to become read- or write-ready. If it becomes write-ready, call
PQflush
again. If it becomes read-ready, call
PQconsumeInput
, then call
PQflush
again. Repeat until
PQflush
returns 0. (It is necessary to check for
read-ready and drain the input with PQconsumeInput
,
because the server can block trying to send us data, e.g., NOTICE
messages, and won't read our data until we read its.) Once
PQflush
returns 0, wait for the socket to be
read-ready and then read the response as described above.
libpq pipeline mode allows applications to send a query without having to read the result of the previously sent query. Taking advantage of the pipeline mode, a client will wait less for the server, since multiple queries/results can be sent/received in a single network transaction.
While pipeline mode provides a significant performance boost, writing clients using the pipeline mode is more complex because it involves managing a queue of pending queries and finding which result corresponds to which query in the queue.
Pipeline mode also generally consumes more memory on both the client and server, though careful and aggressive management of the send/receive queue can mitigate this. This applies whether or not the connection is in blocking or non-blocking mode.
While libpq's pipeline API was introduced in PostgreSQL 14, it is a client-side feature which doesn't require special server support and works on any server that supports the v3 extended query protocol. For more information see Section 53.2.4.
To issue pipelines, the application must switch the connection
into pipeline mode,
which is done with PQenterPipelineMode
.
PQpipelineStatus
can be used
to test whether pipeline mode is active.
In pipeline mode, only asynchronous operations
are permitted, command strings containing multiple SQL commands are
disallowed, and so is COPY
.
Using synchronous command execution functions
such as PQfn
,
PQexec
,
PQexecParams
,
PQprepare
,
PQexecPrepared
,
PQdescribePrepared
,
PQdescribePortal
,
is an error condition.
Once all dispatched commands have had their results processed, and
the end pipeline result has been consumed, the application may return
to non-pipelined mode with PQexitPipelineMode
.
It is best to use pipeline mode with libpq in non-blocking mode. If used in blocking mode it is possible for a client/server deadlock to occur. [15]
After entering pipeline mode, the application dispatches requests using
PQsendQuery
,
PQsendQueryParams
,
or its prepared-query sibling
PQsendQueryPrepared
.
These requests are queued on the client-side until flushed to the server;
this occurs when PQpipelineSync
is used to
establish a synchronization point in the pipeline,
or when PQflush
is called.
The functions PQsendPrepare
,
PQsendDescribePrepared
, and
PQsendDescribePortal
also work in pipeline mode.
Result processing is described below.
The server executes statements, and returns results, in the order the
client sends them. The server will begin executing the commands in the
pipeline immediately, not waiting for the end of the pipeline.
Note that results are buffered on the server side; the server flushes
that buffer when a synchronization point is established with
PQpipelineSync
, or when
PQsendFlushRequest
is called.
If any statement encounters an error, the server aborts the current
transaction and does not execute any subsequent command in the queue
until the next synchronization point;
a PGRES_PIPELINE_ABORTED
result is produced for
each such command.
(This remains true even if the commands in the pipeline would rollback
the transaction.)
Query processing resumes after the synchronization point.
It's fine for one operation to depend on the results of a prior one; for example, one query may define a table that the next query in the same pipeline uses. Similarly, an application may create a named prepared statement and execute it with later statements in the same pipeline.
To process the result of one query in a pipeline, the application calls
PQgetResult
repeatedly and handles each result
until PQgetResult
returns null.
The result from the next query in the pipeline may then be retrieved using
PQgetResult
again and the cycle repeated.
The application handles individual statement results as normal.
When the results of all the queries in the pipeline have been
returned, PQgetResult
returns a result
containing the status value PGRES_PIPELINE_SYNC
The client may choose to defer result processing until the complete pipeline has been sent, or interleave that with sending further queries in the pipeline; see Section 34.5.1.4.
To enter single-row mode, call PQsetSingleRowMode
before retrieving results with PQgetResult
.
This mode selection is effective only for the query currently
being processed. For more information on the use of
PQsetSingleRowMode
,
refer to Section 34.6.
PQgetResult
behaves the same as for normal
asynchronous processing except that it may contain the new
PGresult
types PGRES_PIPELINE_SYNC
and PGRES_PIPELINE_ABORTED
.
PGRES_PIPELINE_SYNC
is reported exactly once for each
PQpipelineSync
at the corresponding point
in the pipeline.
PGRES_PIPELINE_ABORTED
is emitted in place of a normal
query result for the first error and all subsequent results
until the next PGRES_PIPELINE_SYNC
;
see Section 34.5.1.3.
PQisBusy
, PQconsumeInput
, etc
operate as normal when processing pipeline results. In particular,
a call to PQisBusy
in the middle of a pipeline
returns 0 if the results for all the queries issued so far have been
consumed.
libpq does not provide any information to the
application about the query currently being processed (except that
PQgetResult
returns null to indicate that we start
returning the results of next query). The application must keep track
of the order in which it sent queries, to associate them with their
corresponding results.
Applications will typically use a state machine or a FIFO queue for this.
From the client's perspective, after PQresultStatus
returns PGRES_FATAL_ERROR
,
the pipeline is flagged as aborted.
PQresultStatus
will report a
PGRES_PIPELINE_ABORTED
result for each remaining queued
operation in an aborted pipeline. The result for
PQpipelineSync
is reported as
PGRES_PIPELINE_SYNC
to signal the end of the aborted pipeline
and resumption of normal result processing.
The client must process results with
PQgetResult
during error recovery.
If the pipeline used an implicit transaction, then operations that have
already executed are rolled back and operations that were queued to follow
the failed operation are skipped entirely. The same behavior holds if the
pipeline starts and commits a single explicit transaction (i.e. the first
statement is BEGIN
and the last is
COMMIT
) except that the session remains in an aborted
transaction state at the end of the pipeline. If a pipeline contains
multiple explicit transactions, all transactions that
committed prior to the error remain committed, the currently in-progress
transaction is aborted, and all subsequent operations are skipped completely,
including subsequent transactions. If a pipeline synchronization point
occurs with an explicit transaction block in aborted state, the next pipeline
will become aborted immediately unless the next command puts the transaction
in normal mode with ROLLBACK
.
The client must not assume that work is committed when it
sends a COMMIT
— only when the
corresponding result is received to confirm the commit is complete.
Because errors arrive asynchronously, the application needs to be able to
restart from the last received committed change and
resend work done after that point if something goes wrong.
To avoid deadlocks on large pipelines the client should be structured
around a non-blocking event loop using operating system facilities
such as select
, poll
,
WaitForMultipleObjectEx
, etc.
The client application should generally maintain a queue of work remaining to be dispatched and a queue of work that has been dispatched but not yet had its results processed. When the socket is writable it should dispatch more work. When the socket is readable it should read results and process them, matching them up to the next entry in its corresponding results queue. Based on available memory, results from the socket should be read frequently: there's no need to wait until the pipeline end to read the results. Pipelines should be scoped to logical units of work, usually (but not necessarily) one transaction per pipeline. There's no need to exit pipeline mode and re-enter it between pipelines, or to wait for one pipeline to finish before sending the next.
An example using select()
and a simple state
machine to track sent and received work is in
src/test/modules/libpq_pipeline/libpq_pipeline.c
in the PostgreSQL source distribution.
PQpipelineStatus
Returns the current pipeline mode status of the libpq connection.
PGpipelineStatus PQpipelineStatus(const PGconn *conn);
PQpipelineStatus
can return one of the following values:
PQ_PIPELINE_ON
The libpq connection is in pipeline mode.
PQ_PIPELINE_OFF
The libpq connection is not in pipeline mode.
PQ_PIPELINE_ABORTED
The libpq connection is in pipeline
mode and an error occurred while processing the current pipeline.
The aborted flag is cleared when PQgetResult
returns a result of type PGRES_PIPELINE_SYNC
.
PQenterPipelineMode
Causes a connection to enter pipeline mode if it is currently idle or already in pipeline mode.
int PQenterPipelineMode(PGconn *conn);
Returns 1 for success. Returns 0 and has no effect if the connection is not currently idle, i.e., it has a result ready, or it is waiting for more input from the server, etc. This function does not actually send anything to the server, it just changes the libpq connection state.
PQexitPipelineMode
Causes a connection to exit pipeline mode if it is currently in pipeline mode with an empty queue and no pending results.
int PQexitPipelineMode(PGconn *conn);
Returns 1 for success. Returns 1 and takes no action if not in
pipeline mode. If the current statement isn't finished processing,
or PQgetResult
has not been called to collect
results from all previously sent query, returns 0 (in which case,
use PQerrorMessage
to get more information
about the failure).
PQpipelineSync
Marks a synchronization point in a pipeline by sending a sync message and flushing the send buffer. This serves as the delimiter of an implicit transaction and an error recovery point; see Section 34.5.1.3.
int PQpipelineSync(PGconn *conn);
Returns 1 for success. Returns 0 if the connection is not in pipeline mode or sending a sync message failed.
PQsendFlushRequest
Sends a request for the server to flush its output buffer.
int PQsendFlushRequest(PGconn *conn);
Returns 1 for success. Returns 0 on any failure.
The server flushes its output buffer automatically as a result of
PQpipelineSync
being called, or
on any request when not in pipeline mode; this function is useful
to cause the server to flush its output buffer in pipeline mode
without establishing a synchronization point.
Note that the request is not itself flushed to the server automatically;
use PQflush
if necessary.
Much like asynchronous query mode, there is no meaningful performance overhead when using pipeline mode. It increases client application complexity, and extra caution is required to prevent client/server deadlocks, but pipeline mode can offer considerable performance improvements, in exchange for increased memory usage from leaving state around longer.
Pipeline mode is most useful when the server is distant, i.e., network latency (“ping time”) is high, and also when many small operations are being performed in rapid succession. There is usually less benefit in using pipelined commands when each query takes many multiples of the client/server round-trip time to execute. A 100-statement operation run on a server 300 ms round-trip-time away would take 30 seconds in network latency alone without pipelining; with pipelining it may spend as little as 0.3 s waiting for results from the server.
Use pipelined commands when your application does lots of small
INSERT
, UPDATE
and
DELETE
operations that can't easily be transformed
into operations on sets, or into a COPY
operation.
Pipeline mode is not useful when information from one operation is required by the client to produce the next operation. In such cases, the client would have to introduce a synchronization point and wait for a full client/server round-trip to get the results it needs. However, it's often possible to adjust the client design to exchange the required information server-side. Read-modify-write cycles are especially good candidates; for example:
BEGIN; SELECT x FROM mytable WHERE id = 42 FOR UPDATE; -- result: x=2 -- client adds 1 to x: UPDATE mytable SET x = 3 WHERE id = 42; COMMIT;
could be much more efficiently done with:
UPDATE mytable SET x = x + 1 WHERE id = 42;
Pipelining is less useful, and more complex, when a single pipeline contains multiple transactions (see Section 34.5.1.3).
Ordinarily, libpq collects an SQL command's
entire result and returns it to the application as a single
PGresult
. This can be unworkable for commands
that return a large number of rows. For such cases, applications can use
PQsendQuery
and PQgetResult
in
single-row mode. In this mode, the result row(s) are
returned to the application one at a time, as they are received from the
server.
To enter single-row mode, call PQsetSingleRowMode
immediately after a successful call of PQsendQuery
(or a sibling function). This mode selection is effective only for the
currently executing query. Then call PQgetResult
repeatedly, until it returns null, as documented in Section 34.4. If the query returns any rows, they are returned
as individual PGresult
objects, which look like
normal query results except for having status code
PGRES_SINGLE_TUPLE
instead of
PGRES_TUPLES_OK
. After the last row, or immediately if
the query returns zero rows, a zero-row object with status
PGRES_TUPLES_OK
is returned; this is the signal that no
more rows will arrive. (But note that it is still necessary to continue
calling PQgetResult
until it returns null.) All of
these PGresult
objects will contain the same row
description data (column names, types, etc) that an ordinary
PGresult
object for the query would have.
Each object should be freed with PQclear
as usual.
When using pipeline mode, single-row mode needs to be activated for each
query in the pipeline before retrieving results for that query
with PQgetResult
.
See Section 34.5 for more information.
PQsetSingleRowMode
Select single-row mode for the currently-executing query.
int PQsetSingleRowMode(PGconn *conn);
This function can only be called immediately after
PQsendQuery
or one of its sibling functions,
before any other operation on the connection such as
PQconsumeInput
or
PQgetResult
. If called at the correct time,
the function activates single-row mode for the current query and
returns 1. Otherwise the mode stays unchanged and the function
returns 0. In any case, the mode reverts to normal after
completion of the current query.
While processing a query, the server may return some rows and then
encounter an error, causing the query to be aborted. Ordinarily,
libpq discards any such rows and reports only the
error. But in single-row mode, those rows will have already been
returned to the application. Hence, the application will see some
PGRES_SINGLE_TUPLE
PGresult
objects followed by a PGRES_FATAL_ERROR
object. For
proper transactional behavior, the application must be designed to
discard or undo whatever has been done with the previously-processed
rows, if the query ultimately fails.
A client application can request cancellation of a command that is still being processed by the server, using the functions described in this section.
PQgetCancel
Creates a data structure containing the information needed to cancel a command issued through a particular database connection.
PGcancel *PQgetCancel(PGconn *conn);
PQgetCancel
creates a
PGcancel
object
given a PGconn
connection object. It will return
NULL
if the given conn
is NULL
or an invalid
connection. The PGcancel
object is an opaque
structure that is not meant to be accessed directly by the
application; it can only be passed to PQcancel
or PQfreeCancel
.
PQfreeCancel
Frees a data structure created by PQgetCancel
.
void PQfreeCancel(PGcancel *cancel);
PQfreeCancel
frees a data object previously created
by PQgetCancel
.
PQcancel
Requests that the server abandon processing of the current command.
int PQcancel(PGcancel *cancel, char *errbuf, int errbufsize);
The return value is 1 if the cancel request was successfully
dispatched and 0 if not. If not, errbuf
is filled
with an explanatory error message. errbuf
must be a char array of size errbufsize
(the
recommended size is 256 bytes).
Successful dispatch is no guarantee that the request will have any effect, however. If the cancellation is effective, the current command will terminate early and return an error result. If the cancellation fails (say, because the server was already done processing the command), then there will be no visible result at all.
PQcancel
can safely be invoked from a signal
handler, if the errbuf
is a local variable in the
signal handler. The PGcancel
object is read-only
as far as PQcancel
is concerned, so it can
also be invoked from a thread that is separate from the one
manipulating the PGconn
object.
PQrequestCancel
PQrequestCancel
is a deprecated variant of
PQcancel
.
int PQrequestCancel(PGconn *conn);
Requests that the server abandon processing of the current
command. It operates directly on the
PGconn
object, and in case of failure stores the
error message in the PGconn
object (whence it can
be retrieved by PQerrorMessage
). Although
the functionality is the same, this approach creates hazards for
multiple-thread programs and signal handlers, since it is possible
that overwriting the PGconn
's error message will
mess up the operation currently in progress on the connection.
PostgreSQL provides a fast-path interface to send simple function calls to the server.
This interface is somewhat obsolete, as one can achieve similar performance and greater functionality by setting up a prepared statement to define the function call. Then, executing the statement with binary transmission of parameters and results substitutes for a fast-path function call.
The function PQfn
requests execution of a server function via the fast-path interface:
PGresult *PQfn(PGconn *conn, int fnid, int *result_buf, int *result_len, int result_is_int, const PQArgBlock *args, int nargs); typedef struct { int len; int isint; union { int *ptr; int integer; } u; } PQArgBlock;
The fnid
argument is the OID of the function to be
executed. args
and nargs
define the
parameters to be passed to the function; they must match the declared
function argument list. When the isint
field of a
parameter structure is true, the u.integer
value is sent
to the server as an integer of the indicated length (this must be
2 or 4 bytes); proper byte-swapping occurs. When isint
is false, the indicated number of bytes at *u.ptr
are
sent with no processing; the data must be in the format expected by
the server for binary transmission of the function's argument data
type. (The declaration of u.ptr
as being of
type int *
is historical; it would be better to consider
it void *
.)
result_buf
points to the buffer in which to place
the function's return value. The caller must have allocated sufficient
space to store the return value. (There is no check!) The actual result
length in bytes will be returned in the integer pointed to by
result_len
. If a 2- or 4-byte integer result
is expected, set result_is_int
to 1, otherwise
set it to 0. Setting result_is_int
to 1 causes
libpq to byte-swap the value if necessary, so that it
is delivered as a proper int
value for the client machine;
note that a 4-byte integer is delivered into *result_buf
for either allowed result size.
When result_is_int
is 0, the binary-format byte string
sent by the server is returned unmodified. (In this case it's better
to consider result_buf
as being of
type void *
.)
PQfn
always returns a valid
PGresult
pointer, with
status PGRES_COMMAND_OK
for success
or PGRES_FATAL_ERROR
if some problem was encountered.
The result status should be
checked before the result is used. The caller is responsible for
freeing the PGresult
with
PQclear
when it is no longer needed.
To pass a NULL argument to the function, set
the len
field of that parameter structure
to -1
; the isint
and u
fields are then irrelevant.
If the function returns NULL, *result_len
is set
to -1
, and *result_buf
is not
modified.
Note that it is not possible to handle set-valued results when using this interface. Also, the function must be a plain function, not an aggregate, window function, or procedure.
PostgreSQL offers asynchronous notification
via the LISTEN
and NOTIFY
commands. A client session registers its interest in a particular
notification channel with the LISTEN
command (and
can stop listening with the UNLISTEN
command). All
sessions listening on a particular channel will be notified
asynchronously when a NOTIFY
command with that
channel name is executed by any session. A “payload” string can
be passed to communicate additional data to the listeners.
libpq applications submit
LISTEN
, UNLISTEN
,
and NOTIFY
commands as
ordinary SQL commands. The arrival of NOTIFY
messages can subsequently be detected by calling
PQnotifies
.
The function PQnotifies
returns the next notification
from a list of unhandled notification messages received from the server.
It returns a null pointer if there are no pending notifications. Once a
notification is returned from PQnotifies
, it is considered
handled and will be removed from the list of notifications.
PGnotify *PQnotifies(PGconn *conn); typedef struct pgNotify { char *relname; /* notification channel name */ int be_pid; /* process ID of notifying server process */ char *extra; /* notification payload string */ } PGnotify;
After processing a PGnotify
object returned
by PQnotifies
, be sure to free it with
PQfreemem
. It is sufficient to free the
PGnotify
pointer; the
relname
and extra
fields do not represent separate allocations. (The names of these fields
are historical; in particular, channel names need not have anything to
do with relation names.)
Example 34.2 gives a sample program that illustrates the use of asynchronous notification.
PQnotifies
does not actually read data from the
server; it just returns messages previously absorbed by another
libpq function. In ancient releases of
libpq, the only way to ensure timely receipt
of NOTIFY
messages was to constantly submit commands, even
empty ones, and then check PQnotifies
after each
PQexec
. While this still works, it is deprecated
as a waste of processing power.
A better way to check for NOTIFY
messages when you have no
useful commands to execute is to call
PQconsumeInput
, then check
PQnotifies
. You can use
select()
to wait for data to arrive from the
server, thereby using no CPU power unless there is
something to do. (See PQsocket
to obtain the file
descriptor number to use with select()
.) Note that
this will work OK whether you submit commands with
PQsendQuery
/PQgetResult
or
simply use PQexec
. You should, however, remember
to check PQnotifies
after each
PQgetResult
or PQexec
, to
see if any notifications came in during the processing of the command.
COPY
Command
The COPY
command in
PostgreSQL has options to read from or write
to the network connection used by libpq.
The functions described in this section allow applications to take
advantage of this capability by supplying or consuming copied data.
The overall process is that the application first issues the SQL
COPY
command via PQexec
or one
of the equivalent functions. The response to this (if there is no
error in the command) will be a PGresult
object bearing
a status code of PGRES_COPY_OUT
or
PGRES_COPY_IN
(depending on the specified copy
direction). The application should then use the functions of this
section to receive or transmit data rows. When the data transfer is
complete, another PGresult
object is returned to indicate
success or failure of the transfer. Its status will be
PGRES_COMMAND_OK
for success or
PGRES_FATAL_ERROR
if some problem was encountered.
At this point further SQL commands can be issued via
PQexec
. (It is not possible to execute other SQL
commands using the same connection while the COPY
operation is in progress.)
If a COPY
command is issued via
PQexec
in a string that could contain additional
commands, the application must continue fetching results via
PQgetResult
after completing the COPY
sequence. Only when PQgetResult
returns
NULL
is it certain that the PQexec
command string is done and it is safe to issue more commands.
The functions of this section should be executed only after obtaining
a result status of PGRES_COPY_OUT
or
PGRES_COPY_IN
from PQexec
or
PQgetResult
.
A PGresult
object bearing one of these status values
carries some additional data about the COPY
operation
that is starting. This additional data is available using functions
that are also used in connection with query results:
PQnfields
Returns the number of columns (fields) to be copied.
PQbinaryTuples
0 indicates the overall copy format is textual (rows separated by newlines, columns separated by separator characters, etc). 1 indicates the overall copy format is binary. See COPY for more information.
PQfformat
Returns the format code (0 for text, 1 for binary) associated with
each column of the copy operation. The per-column format codes
will always be zero when the overall copy format is textual, but
the binary format can support both text and binary columns.
(However, as of the current implementation of COPY
,
only binary columns appear in a binary copy; so the per-column
formats always match the overall format at present.)
COPY
Data
These functions are used to send data during COPY FROM
STDIN
. They will fail if called when the connection is not in
COPY_IN
state.
PQputCopyData
Sends data to the server during COPY_IN
state.
int PQputCopyData(PGconn *conn, const char *buffer, int nbytes);
Transmits the COPY
data in the specified
buffer
, of length nbytes
, to the server.
The result is 1 if the data was queued, zero if it was not queued
because of full buffers (this will only happen in nonblocking mode),
or -1 if an error occurred.
(Use PQerrorMessage
to retrieve details if
the return value is -1. If the value is zero, wait for write-ready
and try again.)
The application can divide the COPY
data stream
into buffer loads of any convenient size. Buffer-load boundaries
have no semantic significance when sending. The contents of the
data stream must match the data format expected by the
COPY
command; see COPY for details.
PQputCopyEnd
Sends end-of-data indication to the server during COPY_IN
state.
int PQputCopyEnd(PGconn *conn, const char *errormsg);
Ends the COPY_IN
operation successfully if
errormsg
is NULL
. If
errormsg
is not NULL
then the
COPY
is forced to fail, with the string pointed to by
errormsg
used as the error message. (One should not
assume that this exact error message will come back from the server,
however, as the server might have already failed the
COPY
for its own reasons.)
The result is 1 if the termination message was sent; or in
nonblocking mode, this may only indicate that the termination
message was successfully queued. (In nonblocking mode, to be
certain that the data has been sent, you should next wait for
write-ready and call PQflush
, repeating until it
returns zero.) Zero indicates that the function could not queue
the termination message because of full buffers; this will only
happen in nonblocking mode. (In this case, wait for
write-ready and try the PQputCopyEnd
call
again.) If a hard error occurs, -1 is returned; you can use
PQerrorMessage
to retrieve details.
After successfully calling PQputCopyEnd
, call
PQgetResult
to obtain the final result status of the
COPY
command. One can wait for this result to be
available in the usual way. Then return to normal operation.
COPY
Data
These functions are used to receive data during COPY TO
STDOUT
. They will fail if called when the connection is not in
COPY_OUT
state.
PQgetCopyData
Receives data from the server during COPY_OUT
state.
int PQgetCopyData(PGconn *conn, char **buffer, int async);
Attempts to obtain another row of data from the server during a
COPY
. Data is always returned one data row at
a time; if only a partial row is available, it is not returned.
Successful return of a data row involves allocating a chunk of
memory to hold the data. The buffer
parameter must
be non-NULL
. *buffer
is set to
point to the allocated memory, or to NULL
in cases
where no buffer is returned. A non-NULL
result
buffer should be freed using PQfreemem
when no longer
needed.
When a row is successfully returned, the return value is the number
of data bytes in the row (this will always be greater than zero).
The returned string is always null-terminated, though this is
probably only useful for textual COPY
. A result
of zero indicates that the COPY
is still in
progress, but no row is yet available (this is only possible when
async
is true). A result of -1 indicates that the
COPY
is done. A result of -2 indicates that an
error occurred (consult PQerrorMessage
for the reason).
When async
is true (not zero),
PQgetCopyData
will not block waiting for input; it
will return zero if the COPY
is still in progress
but no complete row is available. (In this case wait for read-ready
and then call PQconsumeInput
before calling
PQgetCopyData
again.) When async
is
false (zero), PQgetCopyData
will block until data is
available or the operation completes.
After PQgetCopyData
returns -1, call
PQgetResult
to obtain the final result status of the
COPY
command. One can wait for this result to be
available in the usual way. Then return to normal operation.
COPY
These functions represent older methods of handling COPY
.
Although they still work, they are deprecated due to poor error handling,
inconvenient methods of detecting end-of-data, and lack of support for binary
or nonblocking transfers.
PQgetline
Reads a newline-terminated line of characters (transmitted
by the server) into a buffer string of size length
.
int PQgetline(PGconn *conn, char *buffer, int length);
This function copies up to length
-1 characters into
the buffer and converts the terminating newline into a zero byte.
PQgetline
returns EOF
at the
end of input, 0 if the entire line has been read, and 1 if the
buffer is full but the terminating newline has not yet been read.
Note that the application must check to see if a new line consists
of the two characters \.
, which indicates
that the server has finished sending the results of the
COPY
command. If the application might receive
lines that are more than length
-1 characters long,
care is needed to be sure it recognizes the \.
line correctly (and does not, for example, mistake the end of a
long data line for a terminator line).
PQgetlineAsync
Reads a row of COPY
data (transmitted by the
server) into a buffer without blocking.
int PQgetlineAsync(PGconn *conn, char *buffer, int bufsize);
This function is similar to PQgetline
, but it can be used
by applications
that must read COPY
data asynchronously, that is, without blocking.
Having issued the COPY
command and gotten a PGRES_COPY_OUT
response, the
application should call PQconsumeInput
and
PQgetlineAsync
until the
end-of-data signal is detected.
Unlike PQgetline
, this function takes
responsibility for detecting end-of-data.
On each call, PQgetlineAsync
will return data if a
complete data row is available in libpq's input buffer.
Otherwise, no data is returned until the rest of the row arrives.
The function returns -1 if the end-of-copy-data marker has been recognized,
or 0 if no data is available, or a positive number giving the number of
bytes of data returned. If -1 is returned, the caller must next call
PQendcopy
, and then return to normal processing.
The data returned will not extend beyond a data-row boundary. If possible
a whole row will be returned at one time. But if the buffer offered by
the caller is too small to hold a row sent by the server, then a partial
data row will be returned. With textual data this can be detected by testing
whether the last returned byte is \n
or not. (In a binary
COPY
, actual parsing of the COPY
data format will be needed to make the
equivalent determination.)
The returned string is not null-terminated. (If you want to add a
terminating null, be sure to pass a bufsize
one smaller
than the room actually available.)
PQputline
Sends a null-terminated string to the server. Returns 0 if
OK and EOF
if unable to send the string.
int PQputline(PGconn *conn, const char *string);
The COPY
data stream sent by a series of calls
to PQputline
has the same format as that
returned by PQgetlineAsync
, except that
applications are not obliged to send exactly one data row per
PQputline
call; it is okay to send a partial
line or multiple lines per call.
Before PostgreSQL protocol 3.0, it was necessary
for the application to explicitly send the two characters
\.
as a final line to indicate to the server that it had
finished sending COPY
data. While this still works, it is deprecated and the
special meaning of \.
can be expected to be removed in a
future release. It is sufficient to call PQendcopy
after
having sent the actual data.
PQputnbytes
Sends a non-null-terminated string to the server. Returns
0 if OK and EOF
if unable to send the string.
int PQputnbytes(PGconn *conn, const char *buffer, int nbytes);
This is exactly like PQputline
, except that the data
buffer need not be null-terminated since the number of bytes to send is
specified directly. Use this procedure when sending binary data.
PQendcopy
Synchronizes with the server.
int PQendcopy(PGconn *conn);
This function waits until the server has finished the copying.
It should either be issued when the last string has been sent
to the server using PQputline
or when the
last string has been received from the server using
PQgetline
. It must be issued or the server
will get “out of sync” with the client. Upon return
from this function, the server is ready to receive the next SQL
command. The return value is 0 on successful completion,
nonzero otherwise. (Use PQerrorMessage
to
retrieve details if the return value is nonzero.)
When using PQgetResult
, the application should
respond to a PGRES_COPY_OUT
result by executing
PQgetline
repeatedly, followed by
PQendcopy
after the terminator line is seen.
It should then return to the PQgetResult
loop
until PQgetResult
returns a null pointer.
Similarly a PGRES_COPY_IN
result is processed
by a series of PQputline
calls followed by
PQendcopy
, then return to the
PQgetResult
loop. This arrangement will
ensure that a COPY
command embedded in a series
of SQL commands will be executed correctly.
Older applications are likely to submit a COPY
via PQexec
and assume that the transaction
is done after PQendcopy
. This will work
correctly only if the COPY
is the only
SQL command in the command string.
These functions control miscellaneous details of libpq's behavior.
PQclientEncoding
Returns the client encoding.
int PQclientEncoding(const PGconn *conn
);
Note that it returns the encoding ID, not a symbolic string
such as EUC_JP
. If unsuccessful, it returns -1.
To convert an encoding ID to an encoding name, you
can use:
char *pg_encoding_to_char(int encoding_id
);
PQsetClientEncoding
Sets the client encoding.
int PQsetClientEncoding(PGconn *conn
, const char *encoding
);
conn
is a connection to the server,
and encoding
is the encoding you want to
use. If the function successfully sets the encoding, it returns 0,
otherwise -1. The current encoding for this connection can be
determined by using PQclientEncoding
.
PQsetErrorVerbosity
Determines the verbosity of messages returned by
PQerrorMessage
and PQresultErrorMessage
.
typedef enum { PQERRORS_TERSE, PQERRORS_DEFAULT, PQERRORS_VERBOSE, PQERRORS_SQLSTATE } PGVerbosity; PGVerbosity PQsetErrorVerbosity(PGconn *conn, PGVerbosity verbosity);
PQsetErrorVerbosity
sets the verbosity mode,
returning the connection's previous setting.
In TERSE mode, returned messages include
severity, primary text, and position only; this will normally fit on a
single line. The DEFAULT mode produces messages
that include the above plus any detail, hint, or context fields (these
might span multiple lines). The VERBOSE mode
includes all available fields. The SQLSTATE
mode includes only the error severity and the SQLSTATE
error code, if one is available (if not, the output is like
TERSE mode).
Changing the verbosity setting does not affect the messages available
from already-existing PGresult
objects, only
subsequently-created ones.
(But see PQresultVerboseErrorMessage
if you
want to print a previous error with a different verbosity.)
PQsetErrorContextVisibility
Determines the handling of CONTEXT
fields in messages
returned by PQerrorMessage
and PQresultErrorMessage
.
typedef enum { PQSHOW_CONTEXT_NEVER, PQSHOW_CONTEXT_ERRORS, PQSHOW_CONTEXT_ALWAYS } PGContextVisibility; PGContextVisibility PQsetErrorContextVisibility(PGconn *conn, PGContextVisibility show_context);
PQsetErrorContextVisibility
sets the context display mode,
returning the connection's previous setting. This mode controls
whether the CONTEXT
field is included in messages.
The NEVER mode
never includes CONTEXT
, while ALWAYS always
includes it if available. In ERRORS mode (the
default), CONTEXT
fields are included only in error
messages, not in notices and warnings.
(However, if the verbosity setting is TERSE
or SQLSTATE, CONTEXT
fields
are omitted regardless of the context display mode.)
Changing this mode does not
affect the messages available from
already-existing PGresult
objects, only
subsequently-created ones.
(But see PQresultVerboseErrorMessage
if you
want to print a previous error with a different display mode.)
PQtrace
Enables tracing of the client/server communication to a debugging file stream.
void PQtrace(PGconn *conn, FILE *stream);
Each line consists of: an optional timestamp, a direction indicator
(F
for messages from client to server
or B
for messages from server to client),
message length, message type, and message contents.
Non-message contents fields (timestamp, direction, length and message type)
are separated by a tab. Message contents are separated by a space.
Protocol strings are enclosed in double quotes, while strings used as data
values are enclosed in single quotes. Non-printable chars are printed as
hexadecimal escapes.
Further message-type-specific detail can be found in
Section 53.7.
On Windows, if the libpq library and an application are
compiled with different flags, this function call will crash the
application because the internal representation of the FILE
pointers differ. Specifically, multithreaded/single-threaded,
release/debug, and static/dynamic flags should be the same for the
library and all applications using that library.
PQsetTraceFlags
Controls the tracing behavior of client/server communication.
void PQsetTraceFlags(PGconn *conn, int flags);
flags
contains flag bits describing the operating mode
of tracing.
If flags
contains PQTRACE_SUPPRESS_TIMESTAMPS
,
then the timestamp is not included when printing each message.
If flags
contains PQTRACE_REGRESS_MODE
,
then some fields are redacted when printing each message, such as object
OIDs, to make the output more convenient to use in testing frameworks.
This function must be called after calling PQtrace
.
PQuntrace
Disables tracing started by PQtrace
.
void PQuntrace(PGconn *conn);
As always, there are some functions that just don't fit anywhere.
PQfreemem
Frees memory allocated by libpq.
void PQfreemem(void *ptr);
Frees memory allocated by libpq, particularly
PQescapeByteaConn
,
PQescapeBytea
,
PQunescapeBytea
,
and PQnotifies
.
It is particularly important that this function, rather than
free()
, be used on Microsoft Windows. This is because
allocating memory in a DLL and releasing it in the application works
only if multithreaded/single-threaded, release/debug, and static/dynamic
flags are the same for the DLL and the application. On non-Microsoft
Windows platforms, this function is the same as the standard library
function free()
.
PQconninfoFree
Frees the data structures allocated by
PQconndefaults
or PQconninfoParse
.
void PQconninfoFree(PQconninfoOption *connOptions);
A simple PQfreemem
will not do for this, since
the array contains references to subsidiary strings.
PQencryptPasswordConn
Prepares the encrypted form of a PostgreSQL password.
char *PQencryptPasswordConn(PGconn *conn, const char *passwd, const char *user, const char *algorithm);
This function is intended to be used by client applications that
wish to send commands like ALTER USER joe PASSWORD
'pwd'
. It is good practice not to send the original cleartext
password in such a command, because it might be exposed in command
logs, activity displays, and so on. Instead, use this function to
convert the password to encrypted form before it is sent.
The passwd
and user
arguments
are the cleartext password, and the SQL name of the user it is for.
algorithm
specifies the encryption algorithm
to use to encrypt the password. Currently supported algorithms are
md5
and scram-sha-256
(on
and
off
are also accepted as aliases for md5
, for
compatibility with older server versions). Note that support for
scram-sha-256
was introduced in PostgreSQL
version 10, and will not work correctly with older server versions. If
algorithm
is NULL
, this function will query
the server for the current value of the
password_encryption setting. That can block, and
will fail if the current transaction is aborted, or if the connection
is busy executing another query. If you wish to use the default
algorithm for the server but want to avoid blocking, query
password_encryption
yourself before calling
PQencryptPasswordConn
, and pass that value as the
algorithm
.
The return value is a string allocated by malloc
.
The caller can assume the string doesn't contain any special characters
that would require escaping. Use PQfreemem
to free the
result when done with it. On error, returns NULL
, and
a suitable message is stored in the connection object.
PQencryptPassword
Prepares the md5-encrypted form of a PostgreSQL password.
char *PQencryptPassword(const char *passwd, const char *user);
PQencryptPassword
is an older, deprecated version of
PQencryptPasswordConn
. The difference is that
PQencryptPassword
does not
require a connection object, and md5
is always used as the
encryption algorithm.
PQmakeEmptyPGresult
Constructs an empty PGresult
object with the given status.
PGresult *PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status);
This is libpq's internal function to allocate and
initialize an empty PGresult
object. This
function returns NULL
if memory could not be allocated. It is
exported because some applications find it useful to generate result
objects (particularly objects with error status) themselves. If
conn
is not null and status
indicates an error, the current error message of the specified
connection is copied into the PGresult
.
Also, if conn
is not null, any event procedures
registered in the connection are copied into the
PGresult
. (They do not get
PGEVT_RESULTCREATE
calls, but see
PQfireResultCreateEvents
.)
Note that PQclear
should eventually be called
on the object, just as with a PGresult
returned by libpq itself.
PQfireResultCreateEvents
Fires a PGEVT_RESULTCREATE
event (see Section 34.14) for each event procedure registered in the
PGresult
object. Returns non-zero for success,
zero if any event procedure fails.
int PQfireResultCreateEvents(PGconn *conn, PGresult *res);
The conn
argument is passed through to event procedures
but not used directly. It can be NULL
if the event
procedures won't use it.
Event procedures that have already received a
PGEVT_RESULTCREATE
or PGEVT_RESULTCOPY
event
for this object are not fired again.
The main reason that this function is separate from
PQmakeEmptyPGresult
is that it is often appropriate
to create a PGresult
and fill it with data
before invoking the event procedures.
PQcopyResult
Makes a copy of a PGresult
object. The copy is
not linked to the source result in any way and
PQclear
must be called when the copy is no longer
needed. If the function fails, NULL
is returned.
PGresult *PQcopyResult(const PGresult *src, int flags);
This is not intended to make an exact copy. The returned result is
always put into PGRES_TUPLES_OK
status, and does not
copy any error message in the source. (It does copy the command status
string, however.) The flags
argument determines
what else is copied. It is a bitwise OR of several flags.
PG_COPYRES_ATTRS
specifies copying the source
result's attributes (column definitions).
PG_COPYRES_TUPLES
specifies copying the source
result's tuples. (This implies copying the attributes, too.)
PG_COPYRES_NOTICEHOOKS
specifies
copying the source result's notify hooks.
PG_COPYRES_EVENTS
specifies copying the source
result's events. (But any instance data associated with the source
is not copied.)
PQsetResultAttrs
Sets the attributes of a PGresult
object.
int PQsetResultAttrs(PGresult *res, int numAttributes, PGresAttDesc *attDescs);
The provided attDescs
are copied into the result.
If the attDescs
pointer is NULL
or
numAttributes
is less than one, the request is
ignored and the function succeeds. If res
already contains attributes, the function will fail. If the function
fails, the return value is zero. If the function succeeds, the return
value is non-zero.
PQsetvalue
Sets a tuple field value of a PGresult
object.
int PQsetvalue(PGresult *res, int tup_num, int field_num, char *value, int len);
The function will automatically grow the result's internal tuples array
as needed. However, the tup_num
argument must be
less than or equal to PQntuples
, meaning this
function can only grow the tuples array one tuple at a time. But any
field of any existing tuple can be modified in any order. If a value at
field_num
already exists, it will be overwritten.
If len
is -1 or
value
is NULL
, the field value
will be set to an SQL null value. The
value
is copied into the result's private storage,
thus is no longer needed after the function
returns. If the function fails, the return value is zero. If the
function succeeds, the return value is non-zero.
PQresultAlloc
Allocate subsidiary storage for a PGresult
object.
void *PQresultAlloc(PGresult *res, size_t nBytes);
Any memory allocated with this function will be freed when
res
is cleared. If the function fails,
the return value is NULL
. The result is
guaranteed to be adequately aligned for any type of data,
just as for malloc
.
PQresultMemorySize
Retrieves the number of bytes allocated for
a PGresult
object.
size_t PQresultMemorySize(const PGresult *res);
This value is the sum of all malloc
requests
associated with the PGresult
object, that is,
all the space that will be freed by PQclear
.
This information can be useful for managing memory consumption.
PQlibVersion
Return the version of libpq that is being used.
int PQlibVersion(void);
The result of this function can be used to determine, at
run time, whether specific functionality is available in the currently
loaded version of libpq. The function can be used, for example,
to determine which connection options are available in
PQconnectdb
.
The result is formed by multiplying the library's major version number by 10000 and adding the minor version number. For example, version 10.1 will be returned as 100001, and version 11.0 will be returned as 110000.
Prior to major version 10, PostgreSQL used
three-part version numbers in which the first two parts together
represented the major version. For those
versions, PQlibVersion
uses two digits for each
part; for example version 9.1.5 will be returned as 90105, and
version 9.2.0 will be returned as 90200.
Therefore, for purposes of determining feature compatibility,
applications should divide the result of PQlibVersion
by 100 not 10000 to determine a logical major version number.
In all release series, only the last two digits differ between
minor releases (bug-fix releases).
This function appeared in PostgreSQL version 9.1, so it cannot be used to detect required functionality in earlier versions, since calling it will create a link dependency on version 9.1 or later.
Notice and warning messages generated by the server are not returned
by the query execution functions, since they do not imply failure of
the query. Instead they are passed to a notice handling function, and
execution continues normally after the handler returns. The default
notice handling function prints the message on
stderr
, but the application can override this
behavior by supplying its own handling function.
For historical reasons, there are two levels of notice handling, called the notice receiver and notice processor. The default behavior is for the notice receiver to format the notice and pass a string to the notice processor for printing. However, an application that chooses to provide its own notice receiver will typically ignore the notice processor layer and just do all the work in the notice receiver.
The function PQsetNoticeReceiver
sets or
examines the current notice receiver for a connection object.
Similarly, PQsetNoticeProcessor
sets or
examines the current notice processor.
typedef void (*PQnoticeReceiver) (void *arg, const PGresult *res); PQnoticeReceiver PQsetNoticeReceiver(PGconn *conn, PQnoticeReceiver proc, void *arg); typedef void (*PQnoticeProcessor) (void *arg, const char *message); PQnoticeProcessor PQsetNoticeProcessor(PGconn *conn, PQnoticeProcessor proc, void *arg);
Each of these functions returns the previous notice receiver or processor function pointer, and sets the new value. If you supply a null function pointer, no action is taken, but the current pointer is returned.
When a notice or warning message is received from the server, or
generated internally by libpq, the notice
receiver function is called. It is passed the message in the form of
a PGRES_NONFATAL_ERROR
PGresult
. (This allows the receiver to extract
individual fields using PQresultErrorField
, or obtain a
complete preformatted message using PQresultErrorMessage
or PQresultVerboseErrorMessage
.) The same
void pointer passed to PQsetNoticeReceiver
is also
passed. (This pointer can be used to access application-specific state
if needed.)
The default notice receiver simply extracts the message (using
PQresultErrorMessage
) and passes it to the notice
processor.
The notice processor is responsible for handling a notice or warning
message given in text form. It is passed the string text of the message
(including a trailing newline), plus a void pointer that is the same
one passed to PQsetNoticeProcessor
. (This pointer
can be used to access application-specific state if needed.)
The default notice processor is simply:
static void defaultNoticeProcessor(void *arg, const char *message) { fprintf(stderr, "%s", message); }
Once you have set a notice receiver or processor, you should expect
that that function could be called as long as either the
PGconn
object or PGresult
objects made
from it exist. At creation of a PGresult
, the
PGconn
's current notice handling pointers are copied
into the PGresult
for possible use by functions like
PQgetvalue
.
libpq's event system is designed to notify
registered event handlers about interesting
libpq events, such as the creation or
destruction of PGconn
and
PGresult
objects. A principal use case is that
this allows applications to associate their own data with a
PGconn
or PGresult
and ensure that that data is freed at an appropriate time.
Each registered event handler is associated with two pieces of data,
known to libpq only as opaque void *
pointers. There is a pass-through pointer that is provided
by the application when the event handler is registered with a
PGconn
. The pass-through pointer never changes for the
life of the PGconn
and all PGresult
s
generated from it; so if used, it must point to long-lived data.
In addition there is an instance data pointer, which starts
out NULL
in every PGconn
and PGresult
.
This pointer can be manipulated using the
PQinstanceData
,
PQsetInstanceData
,
PQresultInstanceData
and
PQsetResultInstanceData
functions. Note that
unlike the pass-through pointer, instance data of a PGconn
is not automatically inherited by PGresult
s created from
it. libpq does not know what pass-through
and instance data pointers point to (if anything) and will never attempt
to free them — that is the responsibility of the event handler.
The enum PGEventId
names the types of events handled by
the event system. All its values have names beginning with
PGEVT
. For each event type, there is a corresponding
event info structure that carries the parameters passed to the event
handlers. The event types are:
PGEVT_REGISTER
The register event occurs when PQregisterEventProc
is called. It is the ideal time to initialize any
instanceData
an event procedure may need. Only one
register event will be fired per event handler per connection. If the
event procedure fails, the registration is aborted.
typedef struct { PGconn *conn; } PGEventRegister;
When a PGEVT_REGISTER
event is received, the
evtInfo
pointer should be cast to a
PGEventRegister *
. This structure contains a
PGconn
that should be in the
CONNECTION_OK
status; guaranteed if one calls
PQregisterEventProc
right after obtaining a good
PGconn
. When returning a failure code, all
cleanup must be performed as no PGEVT_CONNDESTROY
event will be sent.
PGEVT_CONNRESET
The connection reset event is fired on completion of
PQreset
or PQresetPoll
. In
both cases, the event is only fired if the reset was successful. If
the event procedure fails, the entire connection reset will fail; the
PGconn
is put into
CONNECTION_BAD
status and
PQresetPoll
will return
PGRES_POLLING_FAILED
.
typedef struct { PGconn *conn; } PGEventConnReset;
When a PGEVT_CONNRESET
event is received, the
evtInfo
pointer should be cast to a
PGEventConnReset *
. Although the contained
PGconn
was just reset, all event data remains
unchanged. This event should be used to reset/reload/requery any
associated instanceData
. Note that even if the
event procedure fails to process PGEVT_CONNRESET
, it will
still receive a PGEVT_CONNDESTROY
event when the connection
is closed.
PGEVT_CONNDESTROY
The connection destroy event is fired in response to
PQfinish
. It is the event procedure's
responsibility to properly clean up its event data as libpq has no
ability to manage this memory. Failure to clean up will lead
to memory leaks.
typedef struct { PGconn *conn; } PGEventConnDestroy;
When a PGEVT_CONNDESTROY
event is received, the
evtInfo
pointer should be cast to a
PGEventConnDestroy *
. This event is fired
prior to PQfinish
performing any other cleanup.
The return value of the event procedure is ignored since there is no
way of indicating a failure from PQfinish
. Also,
an event procedure failure should not abort the process of cleaning up
unwanted memory.
PGEVT_RESULTCREATE
The result creation event is fired in response to any query execution
function that generates a result, including
PQgetResult
. This event will only be fired after
the result has been created successfully.
typedef struct { PGconn *conn; PGresult *result; } PGEventResultCreate;
When a PGEVT_RESULTCREATE
event is received, the
evtInfo
pointer should be cast to a
PGEventResultCreate *
. The
conn
is the connection used to generate the
result. This is the ideal place to initialize any
instanceData
that needs to be associated with the
result. If the event procedure fails, the result will be cleared and
the failure will be propagated. The event procedure must not try to
PQclear
the result object for itself. When returning a
failure code, all cleanup must be performed as no
PGEVT_RESULTDESTROY
event will be sent.
PGEVT_RESULTCOPY
The result copy event is fired in response to
PQcopyResult
. This event will only be fired after
the copy is complete. Only event procedures that have
successfully handled the PGEVT_RESULTCREATE
or PGEVT_RESULTCOPY
event for the source result
will receive PGEVT_RESULTCOPY
events.
typedef struct { const PGresult *src; PGresult *dest; } PGEventResultCopy;
When a PGEVT_RESULTCOPY
event is received, the
evtInfo
pointer should be cast to a
PGEventResultCopy *
. The
src
result is what was copied while the
dest
result is the copy destination. This event
can be used to provide a deep copy of instanceData
,
since PQcopyResult
cannot do that. If the event
procedure fails, the entire copy operation will fail and the
dest
result will be cleared. When returning a
failure code, all cleanup must be performed as no
PGEVT_RESULTDESTROY
event will be sent for the
destination result.
PGEVT_RESULTDESTROY
The result destroy event is fired in response to a
PQclear
. It is the event procedure's
responsibility to properly clean up its event data as libpq has no
ability to manage this memory. Failure to clean up will lead
to memory leaks.
typedef struct { PGresult *result; } PGEventResultDestroy;
When a PGEVT_RESULTDESTROY
event is received, the
evtInfo
pointer should be cast to a
PGEventResultDestroy *
. This event is fired
prior to PQclear
performing any other cleanup.
The return value of the event procedure is ignored since there is no
way of indicating a failure from PQclear
. Also,
an event procedure failure should not abort the process of cleaning up
unwanted memory.
PGEventProc
PGEventProc
is a typedef for a pointer to an
event procedure, that is, the user callback function that receives
events from libpq. The signature of an event procedure must be
int eventproc(PGEventId evtId, void *evtInfo, void *passThrough)
The evtId
parameter indicates which
PGEVT
event occurred. The
evtInfo
pointer must be cast to the appropriate
structure type to obtain further information about the event.
The passThrough
parameter is the pointer
provided to PQregisterEventProc
when the event
procedure was registered. The function should return a non-zero value
if it succeeds and zero if it fails.
A particular event procedure can be registered only once in any
PGconn
. This is because the address of the procedure
is used as a lookup key to identify the associated instance data.
On Windows, functions can have two different addresses: one visible
from outside a DLL and another visible from inside the DLL. One
should be careful that only one of these addresses is used with
libpq's event-procedure functions, else confusion will
result. The simplest rule for writing code that will work is to
ensure that event procedures are declared static
. If the
procedure's address must be available outside its own source file,
expose a separate function to return the address.
PQregisterEventProc
Registers an event callback procedure with libpq.
int PQregisterEventProc(PGconn *conn, PGEventProc proc, const char *name, void *passThrough);
An event procedure must be registered once on each
PGconn
you want to receive events about. There is no
limit, other than memory, on the number of event procedures that
can be registered with a connection. The function returns a non-zero
value if it succeeds and zero if it fails.
The proc
argument will be called when a libpq
event is fired. Its memory address is also used to lookup
instanceData
. The name
argument is used to refer to the event procedure in error messages.
This value cannot be NULL
or a zero-length string. The name string is
copied into the PGconn
, so what is passed need not be
long-lived. The passThrough
pointer is passed
to the proc
whenever an event occurs. This
argument can be NULL
.
PQsetInstanceData
Sets the connection conn
's instanceData
for procedure proc
to data
. This
returns non-zero for success and zero for failure. (Failure is
only possible if proc
has not been properly
registered in conn
.)
int PQsetInstanceData(PGconn *conn, PGEventProc proc, void *data);
PQinstanceData
Returns the
connection conn
's instanceData
associated with procedure proc
,
or NULL
if there is none.
void *PQinstanceData(const PGconn *conn, PGEventProc proc);
PQresultSetInstanceData
Sets the result's instanceData
for proc
to data
. This returns
non-zero for success and zero for failure. (Failure is only
possible if proc
has not been properly registered
in the result.)
int PQresultSetInstanceData(PGresult *res, PGEventProc proc, void *data);
Beware that any storage represented by data
will not be accounted for by PQresultMemorySize
,
unless it is allocated using PQresultAlloc
.
(Doing so is recommendable because it eliminates the need to free
such storage explicitly when the result is destroyed.)
PQresultInstanceData
Returns the result's instanceData
associated with proc
, or NULL
if there is none.
void *PQresultInstanceData(const PGresult *res, PGEventProc proc);
Here is a skeleton example of managing private data associated with libpq connections and results.
/* required header for libpq events (note: includes libpq-fe.h) */ #include <libpq-events.h> /* The instanceData */ typedef struct { int n; char *str; } mydata; /* PGEventProc */ static int myEventProc(PGEventId evtId, void *evtInfo, void *passThrough); int main(void) { mydata *data; PGresult *res; PGconn *conn = PQconnectdb("dbname=postgres options=-csearch_path="); if (PQstatus(conn) != CONNECTION_OK) { /* PQerrorMessage's result includes a trailing newline */ fprintf(stderr, "%s", PQerrorMessage(conn)); PQfinish(conn); return 1; } /* called once on any connection that should receive events. * Sends a PGEVT_REGISTER to myEventProc. */ if (!PQregisterEventProc(conn, myEventProc, "mydata_proc", NULL)) { fprintf(stderr, "Cannot register PGEventProc\n"); PQfinish(conn); return 1; } /* conn instanceData is available */ data = PQinstanceData(conn, myEventProc); /* Sends a PGEVT_RESULTCREATE to myEventProc */ res = PQexec(conn, "SELECT 1 + 1"); /* result instanceData is available */ data = PQresultInstanceData(res, myEventProc); /* If PG_COPYRES_EVENTS is used, sends a PGEVT_RESULTCOPY to myEventProc */ res_copy = PQcopyResult(res, PG_COPYRES_TUPLES | PG_COPYRES_EVENTS); /* result instanceData is available if PG_COPYRES_EVENTS was * used during the PQcopyResult call. */ data = PQresultInstanceData(res_copy, myEventProc); /* Both clears send a PGEVT_RESULTDESTROY to myEventProc */ PQclear(res); PQclear(res_copy); /* Sends a PGEVT_CONNDESTROY to myEventProc */ PQfinish(conn); return 0; } static int myEventProc(PGEventId evtId, void *evtInfo, void *passThrough) { switch (evtId) { case PGEVT_REGISTER: { PGEventRegister *e = (PGEventRegister *)evtInfo; mydata *data = get_mydata(e->conn); /* associate app specific data with connection */ PQsetInstanceData(e->conn, myEventProc, data); break; } case PGEVT_CONNRESET: { PGEventConnReset *e = (PGEventConnReset *)evtInfo; mydata *data = PQinstanceData(e->conn, myEventProc); if (data) memset(data, 0, sizeof(mydata)); break; } case PGEVT_CONNDESTROY: { PGEventConnDestroy *e = (PGEventConnDestroy *)evtInfo; mydata *data = PQinstanceData(e->conn, myEventProc); /* free instance data because the conn is being destroyed */ if (data) free_mydata(data); break; } case PGEVT_RESULTCREATE: { PGEventResultCreate *e = (PGEventResultCreate *)evtInfo; mydata *conn_data = PQinstanceData(e->conn, myEventProc); mydata *res_data = dup_mydata(conn_data); /* associate app specific data with result (copy it from conn) */ PQsetResultInstanceData(e->result, myEventProc, res_data); break; } case PGEVT_RESULTCOPY: { PGEventResultCopy *e = (PGEventResultCopy *)evtInfo; mydata *src_data = PQresultInstanceData(e->src, myEventProc); mydata *dest_data = dup_mydata(src_data); /* associate app specific data with result (copy it from a result) */ PQsetResultInstanceData(e->dest, myEventProc, dest_data); break; } case PGEVT_RESULTDESTROY: { PGEventResultDestroy *e = (PGEventResultDestroy *)evtInfo; mydata *data = PQresultInstanceData(e->result, myEventProc); /* free instance data because the result is being destroyed */ if (data) free_mydata(data); break; } /* unknown event ID, just return true. */ default: break; } return true; /* event processing succeeded */ }
The following environment variables can be used to select default
connection parameter values, which will be used by
PQconnectdb
, PQsetdbLogin
and
PQsetdb
if no value is directly specified by the calling
code. These are useful to avoid hard-coding database connection
information into simple client applications, for example.
PGHOST
behaves the same as the host connection parameter.
PGHOSTADDR
behaves the same as the hostaddr connection parameter.
This can be set instead of or in addition to PGHOST
to avoid DNS lookup overhead.
PGPORT
behaves the same as the port connection parameter.
PGDATABASE
behaves the same as the dbname connection parameter.
PGUSER
behaves the same as the user connection parameter.
PGPASSWORD
behaves the same as the password connection parameter.
Use of this environment variable
is not recommended for security reasons, as some operating systems
allow non-root users to see process environment variables via
ps; instead consider using a password file
(see Section 34.16).
PGPASSFILE
behaves the same as the passfile connection parameter.
PGCHANNELBINDING
behaves the same as the channel_binding connection parameter.
PGSERVICE
behaves the same as the service connection parameter.
PGSERVICEFILE
specifies the name of the per-user
connection service file
(see Section 34.17).
Defaults to ~/.pg_service.conf
, or
%APPDATA%\postgresql\.pg_service.conf
on
Microsoft Windows.
PGOPTIONS
behaves the same as the options connection parameter.
PGAPPNAME
behaves the same as the application_name connection parameter.
PGSSLMODE
behaves the same as the sslmode connection parameter.
PGREQUIRESSL
behaves the same as the requiressl connection parameter.
This environment variable is deprecated in favor of the
PGSSLMODE
variable; setting both variables suppresses the
effect of this one.
PGSSLCOMPRESSION
behaves the same as the sslcompression connection parameter.
PGSSLCERT
behaves the same as the sslcert connection parameter.
PGSSLKEY
behaves the same as the sslkey connection parameter.
PGSSLROOTCERT
behaves the same as the sslrootcert connection parameter.
PGSSLCRL
behaves the same as the sslcrl connection parameter.
PGSSLCRLDIR
behaves the same as the sslcrldir connection parameter.
PGSSLSNI
behaves the same as the sslsni connection parameter.
PGREQUIREPEER
behaves the same as the requirepeer connection parameter.
PGSSLMINPROTOCOLVERSION
behaves the same as the ssl_min_protocol_version connection parameter.
PGSSLMAXPROTOCOLVERSION
behaves the same as the ssl_max_protocol_version connection parameter.
PGGSSENCMODE
behaves the same as the gssencmode connection parameter.
PGKRBSRVNAME
behaves the same as the krbsrvname connection parameter.
PGGSSLIB
behaves the same as the gsslib connection parameter.
PGCONNECT_TIMEOUT
behaves the same as the connect_timeout connection parameter.
PGCLIENTENCODING
behaves the same as the client_encoding connection parameter.
PGTARGETSESSIONATTRS
behaves the same as the target_session_attrs connection parameter.
The following environment variables can be used to specify default behavior for each PostgreSQL session. (See also the ALTER ROLE and ALTER DATABASE commands for ways to set default behavior on a per-user or per-database basis.)
Refer to the SQL command SET for information on correct values for these environment variables.
The following environment variables determine internal behavior of libpq; they override compiled-in defaults.
The file .pgpass
in a user's home directory can
contain passwords to
be used if the connection requires a password (and no password has been
specified otherwise). On Microsoft Windows the file is named
%APPDATA%\postgresql\pgpass.conf
(where
%APPDATA%
refers to the Application Data subdirectory in
the user's profile).
Alternatively, the password file to use can be specified
using the connection parameter passfile
or the environment variable PGPASSFILE
.
This file should contain lines of the following format:
hostname
:port
:database
:username
:password
(You can add a reminder comment to the file by copying the line above and
preceding it with #
.)
Each of the first four fields can be a literal value, or
*
, which matches anything. The password field from
the first line that matches the current connection parameters will be
used. (Therefore, put more-specific entries first when you are using
wildcards.) If an entry needs to contain :
or
\
, escape this character with \
.
The host name field is matched to the host
connection
parameter if that is specified, otherwise to
the hostaddr
parameter if that is specified; if neither
are given then the host name localhost
is searched for.
The host name localhost
is also searched for when
the connection is a Unix-domain socket connection and
the host
parameter
matches libpq's default socket directory path.
In a standby server, a database field of replication
matches streaming replication connections made to the primary server.
The database field is of limited usefulness otherwise, because users have
the same password for all databases in the same cluster.
On Unix systems, the permissions on a password file must
disallow any access to world or group; achieve this by a command such as
chmod 0600 ~/.pgpass
. If the permissions are less
strict than this, the file will be ignored. On Microsoft Windows, it
is assumed that the file is stored in a directory that is secure, so
no special permissions check is made.
The connection service file allows libpq connection parameters to be
associated with a single service name. That service name can then be
specified in a libpq connection string, and the associated settings will be
used. This allows connection parameters to be modified without requiring
a recompile of the libpq-using application. The service name can also be
specified using the PGSERVICE
environment variable.
Service names can be defined in either a per-user service file or a
system-wide file. If the same service name exists in both the user
and the system file, the user file takes precedence.
By default, the per-user service file is named
~/.pg_service.conf
.
On Microsoft Windows, it is named
%APPDATA%\postgresql\.pg_service.conf
(where
%APPDATA%
refers to the Application Data subdirectory
in the user's profile). A different file name can be specified by
setting the environment variable PGSERVICEFILE
.
The system-wide file is named pg_service.conf
.
By default it is sought in the etc
directory
of the PostgreSQL installation
(use pg_config --sysconfdir
to identify this
directory precisely). Another directory, but not a different file
name, can be specified by setting the environment variable
PGSYSCONFDIR
.
Either service file uses an “INI file” format where the section name is the service name and the parameters are connection parameters; see Section 34.1.2 for a list. For example:
# comment [mydb] host=somehost port=5433 user=admin
An example file is provided in
the PostgreSQL installation at
share/pg_service.conf.sample
.
Connection parameters obtained from a service file are combined with
parameters obtained from other sources. A service file setting
overrides the corresponding environment variable, and in turn can be
overridden by a value given directly in the connection string.
For example, using the above service file, a connection string
service=mydb port=5434
will use
host somehost
, port 5434
,
user admin
, and other parameters as set by
environment variables or built-in defaults.
If libpq has been compiled with LDAP support (option
for --with-ldap
configure
)
it is possible to retrieve connection options like host
or dbname
via LDAP from a central server.
The advantage is that if the connection parameters for a database change,
the connection information doesn't have to be updated on all client machines.
LDAP connection parameter lookup uses the connection service file
pg_service.conf
(see Section 34.17). A line in a
pg_service.conf
stanza that starts with
ldap://
will be recognized as an LDAP URL and an
LDAP query will be performed. The result must be a list of
keyword = value
pairs which will be used to set
connection options. The URL must conform to
RFC 1959
and be of the form
ldap://[hostname
[:port
]]/search_base
?attribute
?search_scope
?filter
where hostname
defaults to
localhost
and port
defaults to 389.
Processing of pg_service.conf
is terminated after
a successful LDAP lookup, but is continued if the LDAP server cannot
be contacted. This is to provide a fallback with further LDAP URL
lines that point to different LDAP servers, classical keyword
= value
pairs, or default connection options. If you would
rather get an error message in this case, add a syntactically incorrect
line after the LDAP URL.
A sample LDAP entry that has been created with the LDIF file
version:1 dn:cn=mydatabase,dc=mycompany,dc=com changetype:add objectclass:top objectclass:device cn:mydatabase description:host=dbserver.mycompany.com description:port=5439 description:dbname=mydb description:user=mydb_user description:sslmode=require
might be queried with the following LDAP URL:
ldap://ldap.mycompany.com/dc=mycompany,dc=com?description?one?(cn=mydatabase)
You can also mix regular service file entries with LDAP lookups.
A complete example for a stanza in pg_service.conf
would be:
# only host and port are stored in LDAP, specify dbname and user explicitly [customerdb] dbname=customer user=appuser ldap://ldap.acme.com/cn=dbserver,cn=hosts?pgconnectinfo?base?(objectclass=*)
PostgreSQL has native support for using SSL connections to encrypt client/server communications for increased security. See Section 19.9 for details about the server-side SSL functionality.
libpq reads the system-wide
OpenSSL configuration file. By default, this
file is named openssl.cnf
and is located in the
directory reported by openssl version -d
. This default
can be overridden by setting environment variable
OPENSSL_CONF
to the name of the desired configuration
file.
By default, PostgreSQL will not perform any verification of the server certificate. This means that it is possible to spoof the server identity (for example by modifying a DNS record or by taking over the server IP address) without the client knowing. In order to prevent spoofing, the client must be able to verify the server's identity via a chain of trust. A chain of trust is established by placing a root (self-signed) certificate authority (CA) certificate on one computer and a leaf certificate signed by the root certificate on another computer. It is also possible to use an “intermediate” certificate which is signed by the root certificate and signs leaf certificates.
To allow the client to verify the identity of the server, place a root certificate on the client and a leaf certificate signed by the root certificate on the server. To allow the server to verify the identity of the client, place a root certificate on the server and a leaf certificate signed by the root certificate on the client. One or more intermediate certificates (usually stored with the leaf certificate) can also be used to link the leaf certificate to the root certificate.
Once a chain of trust has been established, there are two ways for
the client to validate the leaf certificate sent by the server.
If the parameter sslmode
is set to verify-ca
,
libpq will verify that the server is trustworthy by checking the
certificate chain up to the root certificate stored on the client.
If sslmode
is set to verify-full
,
libpq will also verify that the server host
name matches the name stored in the server certificate. The
SSL connection will fail if the server certificate cannot be
verified. verify-full
is recommended in most
security-sensitive environments.
In verify-full
mode, the host name is matched against the
certificate's Subject Alternative Name attribute(s), or against the
Common Name attribute if no Subject Alternative Name of type dNSName
is
present. If the certificate's name attribute starts with an asterisk
(*
), the asterisk will be treated as
a wildcard, which will match all characters except a dot
(.
). This means the certificate will not match subdomains.
If the connection is made using an IP address instead of a host name, the
IP address will be matched (without doing any DNS lookups).
To allow server certificate verification, one or more root certificates
must be placed in the file ~/.postgresql/root.crt
in the user's home directory. (On Microsoft Windows the file is named
%APPDATA%\postgresql\root.crt
.) Intermediate
certificates should also be added to the file if they are needed to link
the certificate chain sent by the server to the root certificates
stored on the client.
Certificate Revocation List (CRL) entries are also checked
if the file ~/.postgresql/root.crl
exists
(%APPDATA%\postgresql\root.crl
on Microsoft
Windows).
The location of the root certificate file and the CRL can be changed by
setting
the connection parameters sslrootcert
and sslcrl
or the environment variables PGSSLROOTCERT
and PGSSLCRL
.
sslcrldir
or the environment variable PGSSLCRLDIR
can also be used to specify a directory containing CRL files.
For backwards compatibility with earlier versions of PostgreSQL, if a
root CA file exists, the behavior of
sslmode
=require
will be the same
as that of verify-ca
, meaning the server certificate
is validated against the CA. Relying on this behavior is discouraged,
and applications that need certificate validation should always use
verify-ca
or verify-full
.
If the server attempts to verify the identity of the
client by requesting the client's leaf certificate,
libpq will send the certificate(s) stored in
file ~/.postgresql/postgresql.crt
in the user's home
directory. The certificates must chain to the root certificate trusted
by the server. A matching
private key file ~/.postgresql/postgresql.key
must also
be present.
On Microsoft Windows these files are named
%APPDATA%\postgresql\postgresql.crt
and
%APPDATA%\postgresql\postgresql.key
.
The location of the certificate and key files can be overridden by the
connection parameters sslcert
and sslkey
, or by the
environment variables PGSSLCERT
and PGSSLKEY
.
On Unix systems, the permissions on the private key file must disallow
any access to world or group; achieve this by a command such as
chmod 0600 ~/.postgresql/postgresql.key
.
Alternatively, the file can be owned by root and have group read access
(that is, 0640
permissions). That setup is intended
for installations where certificate and key files are managed by the
operating system. The user of libpq should
then be made a member of the group that has access to those certificate
and key files. (On Microsoft Windows, there is no file permissions
check, since the %APPDATA%\postgresql
directory is
presumed secure.)
The first certificate in postgresql.crt
must be the
client's certificate because it must match the client's private key.
“Intermediate” certificates can be optionally appended
to the file — doing so avoids requiring storage of intermediate
certificates on the server (ssl_ca_file).
The certificate and key may be in PEM or ASN.1 DER format.
The key may be
stored in cleartext or encrypted with a passphrase using any algorithm
supported by OpenSSL, like AES-128. If the key
is stored encrypted, then the passphrase may be provided in the
sslpassword connection option. If an
encrypted key is supplied and the sslpassword
option
is absent or blank, a password will be prompted for interactively by
OpenSSL with a
Enter PEM pass phrase:
prompt if a TTY is available.
Applications can override the client certificate prompt and the handling
of the sslpassword
parameter by supplying their own
key password callback; see
PQsetSSLKeyPassHook_OpenSSL
.
For instructions on creating certificates, see Section 19.9.5.
The different values for the sslmode
parameter provide different
levels of protection. SSL can provide
protection against three types of attacks:
If a third party can examine the network traffic between the client and the server, it can read both connection information (including the user name and password) and the data that is passed. SSL uses encryption to prevent this.
If a third party can modify the data while passing between the client and server, it can pretend to be the server and therefore see and modify data even if it is encrypted. The third party can then forward the connection information and data to the original server, making it impossible to detect this attack. Common vectors to do this include DNS poisoning and address hijacking, whereby the client is directed to a different server than intended. There are also several other attack methods that can accomplish this. SSL uses certificate verification to prevent this, by authenticating the server to the client.
If a third party can pretend to be an authorized client, it can simply access data it should not have access to. Typically this can happen through insecure password management. SSL uses client certificates to prevent this, by making sure that only holders of valid certificates can access the server.
For a connection to be known SSL-secured, SSL usage must be configured
on both the client and the server before the connection
is made. If it is only configured on the server, the client may end up
sending sensitive information (e.g., passwords) before
it knows that the server requires high security. In libpq, secure
connections can be ensured
by setting the sslmode
parameter to verify-full
or
verify-ca
, and providing the system with a root certificate to
verify against. This is analogous to using an https
URL for encrypted web browsing.
Once the server has been authenticated, the client can pass sensitive data. This means that up until this point, the client does not need to know if certificates will be used for authentication, making it safe to specify that only in the server configuration.
All SSL options carry overhead in the form of encryption and
key-exchange, so there is a trade-off that has to be made between performance
and security. Table 34.1
illustrates the risks the different sslmode
values
protect against, and what statement they make about security and overhead.
Table 34.1. SSL Mode Descriptions
sslmode | Eavesdropping protection | MITM protection | Statement |
---|---|---|---|
disable | No | No | I don't care about security, and I don't want to pay the overhead of encryption. |
allow | Maybe | No | I don't care about security, but I will pay the overhead of encryption if the server insists on it. |
prefer | Maybe | No | I don't care about encryption, but I wish to pay the overhead of encryption if the server supports it. |
require | Yes | No | I want my data to be encrypted, and I accept the overhead. I trust that the network will make sure I always connect to the server I want. |
verify-ca | Yes | Depends on CA policy | I want my data encrypted, and I accept the overhead. I want to be sure that I connect to a server that I trust. |
verify-full | Yes | Yes | I want my data encrypted, and I accept the overhead. I want to be sure that I connect to a server I trust, and that it's the one I specify. |
The difference between verify-ca
and verify-full
depends on the policy of the root CA. If a public
CA is used, verify-ca
allows connections to a server
that somebody else may have registered with the CA.
In this case, verify-full
should always be used. If
a local CA is used, or even a self-signed certificate, using
verify-ca
often provides enough protection.
The default value for sslmode
is prefer
. As is shown
in the table, this makes no sense from a security point of view, and it only
promises performance overhead if possible. It is only provided as the default
for backward compatibility, and is not recommended in secure deployments.
Table 34.2 summarizes the files that are relevant to the SSL setup on the client.
Table 34.2. Libpq/Client SSL File Usage
File | Contents | Effect |
---|---|---|
~/.postgresql/postgresql.crt | client certificate | sent to server |
~/.postgresql/postgresql.key | client private key | proves client certificate sent by owner; does not indicate certificate owner is trustworthy |
~/.postgresql/root.crt | trusted certificate authorities | checks that server certificate is signed by a trusted certificate authority |
~/.postgresql/root.crl | certificates revoked by certificate authorities | server certificate must not be on this list |
If your application initializes libssl
and/or
libcrypto
libraries and libpq
is built with SSL support, you should call
PQinitOpenSSL
to tell libpq
that the libssl
and/or libcrypto
libraries
have been initialized by your application, so that
libpq will not also initialize those libraries.
However, this is unnecessary when using OpenSSL
version 1.1.0 or later, as duplicate initializations are no longer problematic.
PQinitOpenSSL
Allows applications to select which security libraries to initialize.
void PQinitOpenSSL(int do_ssl, int do_crypto);
When do_ssl
is non-zero, libpq
will initialize the OpenSSL library before first
opening a database connection. When do_crypto
is
non-zero, the libcrypto
library will be initialized. By
default (if PQinitOpenSSL
is not called), both libraries
are initialized. When SSL support is not compiled in, this function is
present but does nothing.
If your application uses and initializes either OpenSSL
or its underlying libcrypto
library, you must
call this function with zeroes for the appropriate parameter(s)
before first opening a database connection. Also be sure that you
have done that initialization before opening a database connection.
PQinitSSL
Allows applications to select which security libraries to initialize.
void PQinitSSL(int do_ssl);
This function is equivalent to
PQinitOpenSSL(do_ssl, do_ssl)
.
It is sufficient for applications that initialize both or neither
of OpenSSL and libcrypto
.
PQinitSSL
has been present since
PostgreSQL 8.0, while PQinitOpenSSL
was added in PostgreSQL 8.4, so PQinitSSL
might be preferable for applications that need to work with older
versions of libpq.
libpq is reentrant and thread-safe by default.
You might need to use special compiler command-line
options when you compile your application code. Refer to your
system's documentation for information about how to build
thread-enabled applications, or look in
src/Makefile.global
for PTHREAD_CFLAGS
and PTHREAD_LIBS
. This function allows the querying of
libpq's thread-safe status:
One thread restriction is that no two threads attempt to manipulate
the same PGconn
object at the same time. In particular,
you cannot issue concurrent commands from different threads through
the same connection object. (If you need to run concurrent commands,
use multiple connections.)
PGresult
objects are normally read-only after creation,
and so can be passed around freely between threads. However, if you use
any of the PGresult
-modifying functions described in
Section 34.12 or Section 34.14, it's up
to you to avoid concurrent operations on the same PGresult
,
too.
The deprecated functions PQrequestCancel
and
PQoidStatus
are not thread-safe and should not be
used in multithread programs. PQrequestCancel
can be replaced by PQcancel
.
PQoidStatus
can be replaced by
PQoidValue
.
If you are using Kerberos inside your application (in addition to inside
libpq), you will need to do locking around
Kerberos calls because Kerberos functions are not thread-safe. See
function PQregisterThreadLock
in the
libpq source code for a way to do cooperative
locking between libpq and your application.
To build (i.e., compile and link) a program using libpq you need to do all of the following things:
Include the libpq-fe.h
header file:
#include <libpq-fe.h>
If you failed to do that then you will normally get error messages from your compiler similar to:
foo.c: In function `main': foo.c:34: `PGconn' undeclared (first use in this function) foo.c:35: `PGresult' undeclared (first use in this function) foo.c:54: `CONNECTION_BAD' undeclared (first use in this function) foo.c:68: `PGRES_COMMAND_OK' undeclared (first use in this function) foo.c:95: `PGRES_TUPLES_OK' undeclared (first use in this function)
Point your compiler to the directory where the PostgreSQL header
files were installed, by supplying the
-I
option
to your compiler. (In some cases the compiler will look into
the directory in question by default, so you can omit this
option.) For instance, your compile command line could look
like:
directory
cc -c -I/usr/local/pgsql/include testprog.c
If you are using makefiles then add the option to the
CPPFLAGS
variable:
CPPFLAGS += -I/usr/local/pgsql/include
If there is any chance that your program might be compiled by
other users then you should not hardcode the directory location
like that. Instead, you can run the utility
pg_config
to find out where the header
files are on the local system:
$
pg_config --includedir/usr/local/include
If you
have pkg-config
installed, you can run instead:
$
pkg-config --cflags libpq-I/usr/local/include
Note that this will already include the -I
in front of
the path.
Failure to specify the correct option to the compiler will result in an error message such as:
testlibpq.c:8:22: libpq-fe.h: No such file or directory
When linking the final program, specify the option
-lpq
so that the libpq
library gets pulled in, as well as the option
-L
to point
the compiler to the directory where the
libpq library resides. (Again, the
compiler will search some directories by default.) For maximum
portability, put the directory
-L
option before the
-lpq
option. For example:
cc -o testprog testprog1.o testprog2.o -L/usr/local/pgsql/lib -lpq
You can find out the library directory using
pg_config
as well:
$
pg_config --libdir/usr/local/pgsql/lib
Or again use pkg-config
:
$
pkg-config --libs libpq-L/usr/local/pgsql/lib -lpq
Note again that this prints the full options, not only the path.
Error messages that point to problems in this area could look like the following:
testlibpq.o: In function `main': testlibpq.o(.text+0x60): undefined reference to `PQsetdbLogin' testlibpq.o(.text+0x71): undefined reference to `PQstatus' testlibpq.o(.text+0xa4): undefined reference to `PQerrorMessage'
This means you forgot -lpq
.
/usr/bin/ld: cannot find -lpq
This means you forgot the -L
option or did not
specify the right directory.
These examples and others can be found in the
directory src/test/examples
in the source code
distribution.
Example 34.1. libpq Example Program 1
/* * src/test/examples/testlibpq.c * * * testlibpq.c * * Test the C version of libpq, the PostgreSQL frontend library. */ #include <stdio.h> #include <stdlib.h> #include "libpq-fe.h" static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } int main(int argc, char **argv) { const char *conninfo; PGconn *conn; PGresult *res; int nFields; int i, j; /* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to setting dbname=postgres and using * environment variables or defaults for all other connection parameters. */ if (argc > 1) conninfo = argv[1]; else conninfo = "dbname = postgres"; /* Make a connection to the database */ conn = PQconnectdb(conninfo); /* Check to see that the backend connection was successfully made */ if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, "%s", PQerrorMessage(conn)); exit_nicely(conn); } /* Set always-secure search path, so malicious users can't take control. */ res = PQexec(conn, "SELECT pg_catalog.set_config('search_path', '', false)"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "SET failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } /* * Should PQclear PGresult whenever it is no longer needed to avoid memory * leaks */ PQclear(res); /* * Our test case here involves using a cursor, for which we must be inside * a transaction block. We could do the whole thing with a single * PQexec() of "select * from pg_database", but that's too trivial to make * a good example. */ /* Start a transaction block */ res = PQexec(conn, "BEGIN"); if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, "BEGIN command failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } PQclear(res); /* * Fetch rows from pg_database, the system catalog of databases */ res = PQexec(conn, "DECLARE myportal CURSOR FOR select * from pg_database"); if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, "DECLARE CURSOR failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } PQclear(res); res = PQexec(conn, "FETCH ALL in myportal"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "FETCH ALL failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } /* first, print out the attribute names */ nFields = PQnfields(res); for (i = 0; i < nFields; i++) printf("%-15s", PQfname(res, i)); printf("\n\n"); /* next, print out the rows */ for (i = 0; i < PQntuples(res); i++) { for (j = 0; j < nFields; j++) printf("%-15s", PQgetvalue(res, i, j)); printf("\n"); } PQclear(res); /* close the portal ... we don't bother to check for errors ... */ res = PQexec(conn, "CLOSE myportal"); PQclear(res); /* end the transaction */ res = PQexec(conn, "END"); PQclear(res); /* close the connection to the database and cleanup */ PQfinish(conn); return 0; }
Example 34.2. libpq Example Program 2
/* * src/test/examples/testlibpq2.c * * * testlibpq2.c * Test of the asynchronous notification interface * * Start this program, then from psql in another window do * NOTIFY TBL2; * Repeat four times to get this program to exit. * * Or, if you want to get fancy, try this: * populate a database with the following commands * (provided in src/test/examples/testlibpq2.sql): * * CREATE SCHEMA TESTLIBPQ2; * SET search_path = TESTLIBPQ2; * CREATE TABLE TBL1 (i int4); * CREATE TABLE TBL2 (i int4); * CREATE RULE r1 AS ON INSERT TO TBL1 DO * (INSERT INTO TBL2 VALUES (new.i); NOTIFY TBL2); * * Start this program, then from psql do this four times: * * INSERT INTO TESTLIBPQ2.TBL1 VALUES (10); */ #ifdef WIN32 #include <windows.h> #endif #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <sys/time.h> #include <sys/types.h> #ifdef HAVE_SYS_SELECT_H #include <sys/select.h> #endif #include "libpq-fe.h" static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } int main(int argc, char **argv) { const char *conninfo; PGconn *conn; PGresult *res; PGnotify *notify; int nnotifies; /* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to setting dbname=postgres and using * environment variables or defaults for all other connection parameters. */ if (argc > 1) conninfo = argv[1]; else conninfo = "dbname = postgres"; /* Make a connection to the database */ conn = PQconnectdb(conninfo); /* Check to see that the backend connection was successfully made */ if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, "%s", PQerrorMessage(conn)); exit_nicely(conn); } /* Set always-secure search path, so malicious users can't take control. */ res = PQexec(conn, "SELECT pg_catalog.set_config('search_path', '', false)"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "SET failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } /* * Should PQclear PGresult whenever it is no longer needed to avoid memory * leaks */ PQclear(res); /* * Issue LISTEN command to enable notifications from the rule's NOTIFY. */ res = PQexec(conn, "LISTEN TBL2"); if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, "LISTEN command failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } PQclear(res); /* Quit after four notifies are received. */ nnotifies = 0; while (nnotifies < 4) { /* * Sleep until something happens on the connection. We use select(2) * to wait for input, but you could also use poll() or similar * facilities. */ int sock; fd_set input_mask; sock = PQsocket(conn); if (sock < 0) break; /* shouldn't happen */ FD_ZERO(&input_mask); FD_SET(sock, &input_mask); if (select(sock + 1, &input_mask, NULL, NULL, NULL) < 0) { fprintf(stderr, "select() failed: %s\n", strerror(errno)); exit_nicely(conn); } /* Now check for input */ PQconsumeInput(conn); while ((notify = PQnotifies(conn)) != NULL) { fprintf(stderr, "ASYNC NOTIFY of '%s' received from backend PID %d\n", notify->relname, notify->be_pid); PQfreemem(notify); nnotifies++; PQconsumeInput(conn); } } fprintf(stderr, "Done.\n"); /* close the connection to the database and cleanup */ PQfinish(conn); return 0; }
Example 34.3. libpq Example Program 3
/* * src/test/examples/testlibpq3.c * * * testlibpq3.c * Test out-of-line parameters and binary I/O. * * Before running this, populate a database with the following commands * (provided in src/test/examples/testlibpq3.sql): * * CREATE SCHEMA testlibpq3; * SET search_path = testlibpq3; * SET standard_conforming_strings = ON; * CREATE TABLE test1 (i int4, t text, b bytea); * INSERT INTO test1 values (1, 'joe''s place', '\000\001\002\003\004'); * INSERT INTO test1 values (2, 'ho there', '\004\003\002\001\000'); * * The expected output is: * * tuple 0: got * i = (4 bytes) 1 * t = (11 bytes) 'joe's place' * b = (5 bytes) \000\001\002\003\004 * * tuple 0: got * i = (4 bytes) 2 * t = (8 bytes) 'ho there' * b = (5 bytes) \004\003\002\001\000 */ #ifdef WIN32 #include <windows.h> #endif #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <sys/types.h> #include "libpq-fe.h" /* for ntohl/htonl */ #include <netinet/in.h> #include <arpa/inet.h> static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } /* * This function prints a query result that is a binary-format fetch from * a table defined as in the comment above. We split it out because the * main() function uses it twice. */ static void show_binary_results(PGresult *res) { int i, j; int i_fnum, t_fnum, b_fnum; /* Use PQfnumber to avoid assumptions about field order in result */ i_fnum = PQfnumber(res, "i"); t_fnum = PQfnumber(res, "t"); b_fnum = PQfnumber(res, "b"); for (i = 0; i < PQntuples(res); i++) { char *iptr; char *tptr; char *bptr; int blen; int ival; /* Get the field values (we ignore possibility they are null!) */ iptr = PQgetvalue(res, i, i_fnum); tptr = PQgetvalue(res, i, t_fnum); bptr = PQgetvalue(res, i, b_fnum); /* * The binary representation of INT4 is in network byte order, which * we'd better coerce to the local byte order. */ ival = ntohl(*((uint32_t *) iptr)); /* * The binary representation of TEXT is, well, text, and since libpq * was nice enough to append a zero byte to it, it'll work just fine * as a C string. * * The binary representation of BYTEA is a bunch of bytes, which could * include embedded nulls so we have to pay attention to field length. */ blen = PQgetlength(res, i, b_fnum); printf("tuple %d: got\n", i); printf(" i = (%d bytes) %d\n", PQgetlength(res, i, i_fnum), ival); printf(" t = (%d bytes) '%s'\n", PQgetlength(res, i, t_fnum), tptr); printf(" b = (%d bytes) ", blen); for (j = 0; j < blen; j++) printf("\\%03o", bptr[j]); printf("\n\n"); } } int main(int argc, char **argv) { const char *conninfo; PGconn *conn; PGresult *res; const char *paramValues[1]; int paramLengths[1]; int paramFormats[1]; uint32_t binaryIntVal; /* * If the user supplies a parameter on the command line, use it as the * conninfo string; otherwise default to setting dbname=postgres and using * environment variables or defaults for all other connection parameters. */ if (argc > 1) conninfo = argv[1]; else conninfo = "dbname = postgres"; /* Make a connection to the database */ conn = PQconnectdb(conninfo); /* Check to see that the backend connection was successfully made */ if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, "%s", PQerrorMessage(conn)); exit_nicely(conn); } /* Set always-secure search path, so malicious users can't take control. */ res = PQexec(conn, "SET search_path = testlibpq3"); if (PQresultStatus(res) != PGRES_COMMAND_OK) { fprintf(stderr, "SET failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } PQclear(res); /* * The point of this program is to illustrate use of PQexecParams() with * out-of-line parameters, as well as binary transmission of data. * * This first example transmits the parameters as text, but receives the * results in binary format. By using out-of-line parameters we can avoid * a lot of tedious mucking about with quoting and escaping, even though * the data is text. Notice how we don't have to do anything special with * the quote mark in the parameter value. */ /* Here is our out-of-line parameter value */ paramValues[0] = "joe's place"; res = PQexecParams(conn, "SELECT * FROM test1 WHERE t = $1", 1, /* one param */ NULL, /* let the backend deduce param type */ paramValues, NULL, /* don't need param lengths since text */ NULL, /* default to all text params */ 1); /* ask for binary results */ if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } show_binary_results(res); PQclear(res); /* * In this second example we transmit an integer parameter in binary form, * and again retrieve the results in binary form. * * Although we tell PQexecParams we are letting the backend deduce * parameter type, we really force the decision by casting the parameter * symbol in the query text. This is a good safety measure when sending * binary parameters. */ /* Convert integer value "2" to network byte order */ binaryIntVal = htonl((uint32_t) 2); /* Set up parameter arrays for PQexecParams */ paramValues[0] = (char *) &binaryIntVal; paramLengths[0] = sizeof(binaryIntVal); paramFormats[0] = 1; /* binary */ res = PQexecParams(conn, "SELECT * FROM test1 WHERE i = $1::int4", 1, /* one param */ NULL, /* let the backend deduce param type */ paramValues, paramLengths, paramFormats, 1); /* ask for binary results */ if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "SELECT failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } show_binary_results(res); PQclear(res); /* close the connection to the database and cleanup */ PQfinish(conn); return 0; }
[15] The client will block trying to send queries to the server, but the server will block trying to send results to the client from queries it has already processed. This only occurs when the client sends enough queries to fill both its output buffer and the server's receive buffer before it switches to processing input from the server, but it's hard to predict exactly when that will happen.
Table of Contents
PostgreSQL has a large object facility, which provides stream-style access to user data that is stored in a special large-object structure. Streaming access is useful when working with data values that are too large to manipulate conveniently as a whole.
This chapter describes the implementation and the programming and query language interfaces to PostgreSQL large object data. We use the libpq C library for the examples in this chapter, but most programming interfaces native to PostgreSQL support equivalent functionality. Other interfaces might use the large object interface internally to provide generic support for large values. This is not described here.
All large objects are stored in a single system table named pg_largeobject
.
Each large object also has an entry in the system table pg_largeobject_metadata
.
Large objects can be created, modified, and deleted using a read/write API
that is similar to standard operations on files.
PostgreSQL also supports a storage system called “TOAST”, which automatically stores values larger than a single database page into a secondary storage area per table. This makes the large object facility partially obsolete. One remaining advantage of the large object facility is that it allows values up to 4 TB in size, whereas TOASTed fields can be at most 1 GB. Also, reading and updating portions of a large object can be done efficiently, while most operations on a TOASTed field will read or write the whole value as a unit.
The large object implementation breaks large objects up into “chunks” and stores the chunks in rows in the database. A B-tree index guarantees fast searches for the correct chunk number when doing random access reads and writes.
The chunks stored for a large object do not have to be contiguous. For example, if an application opens a new large object, seeks to offset 1000000, and writes a few bytes there, this does not result in allocation of 1000000 bytes worth of storage; only of chunks covering the range of data bytes actually written. A read operation will, however, read out zeroes for any unallocated locations preceding the last existing chunk. This corresponds to the common behavior of “sparsely allocated” files in Unix file systems.
As of PostgreSQL 9.0, large objects have an owner
and a set of access permissions, which can be managed using
GRANT and
REVOKE.
SELECT
privileges are required to read a large
object, and
UPDATE
privileges are required to write or
truncate it.
Only the large object's owner (or a database superuser) can delete,
comment on, or change the owner of a large object.
To adjust this behavior for compatibility with prior releases, see the
lo_compat_privileges run-time parameter.
This section describes the facilities that
PostgreSQL's libpq
client interface library provides for accessing large objects.
The PostgreSQL large object interface is
modeled after the Unix file-system interface, with
analogues of open
, read
,
write
,
lseek
, etc.
All large object manipulation using these functions must take place within an SQL transaction block, since large object file descriptors are only valid for the duration of a transaction.
If an error occurs while executing any one of these functions, the
function will return an otherwise-impossible value, typically 0 or -1.
A message describing the error is stored in the connection object and
can be retrieved with PQerrorMessage
.
Client applications that use these functions should include the header file
libpq/libpq-fs.h
and link with the
libpq library.
Client applications cannot use these functions while a libpq connection is in pipeline mode.
Oid lo_creat(PGconn *conn, int mode);
creates a new large object.
The return value is the OID that was assigned to the new large object,
or InvalidOid
(zero) on failure.
mode
is unused and
ignored as of PostgreSQL 8.1; however, for
backward compatibility with earlier releases it is best to
set it to INV_READ
, INV_WRITE
,
or INV_READ
|
INV_WRITE
.
(These symbolic constants are defined
in the header file libpq/libpq-fs.h
.)
An example:
inv_oid = lo_creat(conn, INV_READ|INV_WRITE);
Oid lo_create(PGconn *conn, Oid lobjId);
also creates a new large object. The OID to be assigned can be
specified by lobjId
;
if so, failure occurs if that OID is already in use for some large
object. If lobjId
is InvalidOid
(zero) then lo_create
assigns an unused
OID (this is the same behavior as lo_creat
).
The return value is the OID that was assigned to the new large object,
or InvalidOid
(zero) on failure.
lo_create
is new as of PostgreSQL
8.1; if this function is run against an older server version, it will
fail and return InvalidOid
.
An example:
inv_oid = lo_create(conn, desired_oid);
To import an operating system file as a large object, call
Oid lo_import(PGconn *conn, const char *filename);
filename
specifies the operating system name of
the file to be imported as a large object.
The return value is the OID that was assigned to the new large object,
or InvalidOid
(zero) on failure.
Note that the file is read by the client interface library, not by
the server; so it must exist in the client file system and be readable
by the client application.
Oid lo_import_with_oid(PGconn *conn, const char *filename, Oid lobjId);
also imports a new large object. The OID to be assigned can be
specified by lobjId
;
if so, failure occurs if that OID is already in use for some large
object. If lobjId
is InvalidOid
(zero) then lo_import_with_oid
assigns an unused
OID (this is the same behavior as lo_import
).
The return value is the OID that was assigned to the new large object,
or InvalidOid
(zero) on failure.
lo_import_with_oid
is new as of PostgreSQL
8.4 and uses lo_create
internally which is new in 8.1; if this function is run against 8.0 or before, it will
fail and return InvalidOid
.
To export a large object into an operating system file, call
int lo_export(PGconn *conn, Oid lobjId, const char *filename);
The lobjId
argument specifies the OID of the large
object to export and the filename
argument
specifies the operating system name of the file. Note that the file is
written by the client interface library, not by the server. Returns 1
on success, -1 on failure.
To open an existing large object for reading or writing, call
int lo_open(PGconn *conn, Oid lobjId, int mode);
The lobjId
argument specifies the OID of the large
object to open. The mode
bits control whether the
object is opened for reading (INV_READ
), writing
(INV_WRITE
), or both.
(These symbolic constants are defined
in the header file libpq/libpq-fs.h
.)
lo_open
returns a (non-negative) large object
descriptor for later use in lo_read
,
lo_write
, lo_lseek
,
lo_lseek64
, lo_tell
,
lo_tell64
, lo_truncate
,
lo_truncate64
, and lo_close
.
The descriptor is only valid for
the duration of the current transaction.
On failure, -1 is returned.
The server currently does not distinguish between modes
INV_WRITE
and INV_READ
|
INV_WRITE
: you are allowed to read from the descriptor
in either case. However there is a significant difference between
these modes and INV_READ
alone: with INV_READ
you cannot write on the descriptor, and the data read from it will
reflect the contents of the large object at the time of the transaction
snapshot that was active when lo_open
was executed,
regardless of later writes by this or other transactions. Reading
from a descriptor opened with INV_WRITE
returns
data that reflects all writes of other committed transactions as well
as writes of the current transaction. This is similar to the behavior
of REPEATABLE READ
versus READ COMMITTED
transaction
modes for ordinary SQL SELECT
commands.
lo_open
will fail if SELECT
privilege is not available for the large object, or
if INV_WRITE
is specified and UPDATE
privilege is not available.
(Prior to PostgreSQL 11, these privilege
checks were instead performed at the first actual read or write call
using the descriptor.)
These privilege checks can be disabled with the
lo_compat_privileges run-time parameter.
An example:
inv_fd = lo_open(conn, inv_oid, INV_READ|INV_WRITE);
int lo_write(PGconn *conn, int fd, const char *buf, size_t len);
writes len
bytes from buf
(which must be of size len
) to large object
descriptor fd
. The fd
argument must
have been returned by a previous lo_open
. The
number of bytes actually written is returned (in the current
implementation, this will always equal len
unless
there is an error). In the event of an error, the return value is -1.
Although the len
parameter is declared as
size_t
, this function will reject length values larger than
INT_MAX
. In practice, it's best to transfer data in chunks
of at most a few megabytes anyway.
int lo_read(PGconn *conn, int fd, char *buf, size_t len);
reads up to len
bytes from large object descriptor
fd
into buf
(which must be
of size len
). The fd
argument must have been returned by a previous
lo_open
. The number of bytes actually read is
returned; this will be less than len
if the end of
the large object is reached first. In the event of an error, the return
value is -1.
Although the len
parameter is declared as
size_t
, this function will reject length values larger than
INT_MAX
. In practice, it's best to transfer data in chunks
of at most a few megabytes anyway.
To change the current read or write location associated with a large object descriptor, call
int lo_lseek(PGconn *conn, int fd, int offset, int whence);
This function moves the
current location pointer for the large object descriptor identified by
fd
to the new location specified by
offset
. The valid values for whence
are SEEK_SET
(seek from object start),
SEEK_CUR
(seek from current position), and
SEEK_END
(seek from object end). The return value is
the new location pointer, or -1 on error.
When dealing with large objects that might exceed 2GB in size, instead use
pg_int64 lo_lseek64(PGconn *conn, int fd, pg_int64 offset, int whence);
This function has the same behavior
as lo_lseek
, but it can accept an
offset
larger than 2GB and/or deliver a result larger
than 2GB.
Note that lo_lseek
will fail if the new location
pointer would be greater than 2GB.
lo_lseek64
is new as of PostgreSQL
9.3. If this function is run against an older server version, it will
fail and return -1.
To obtain the current read or write location of a large object descriptor, call
int lo_tell(PGconn *conn, int fd);
If there is an error, the return value is -1.
When dealing with large objects that might exceed 2GB in size, instead use
pg_int64 lo_tell64(PGconn *conn, int fd);
This function has the same behavior
as lo_tell
, but it can deliver a result larger
than 2GB.
Note that lo_tell
will fail if the current
read/write location is greater than 2GB.
lo_tell64
is new as of PostgreSQL
9.3. If this function is run against an older server version, it will
fail and return -1.
To truncate a large object to a given length, call
int lo_truncate(PGconn *conn, int fd, size_t len);
This function truncates the large object
descriptor fd
to length len
. The
fd
argument must have been returned by a
previous lo_open
. If len
is
greater than the large object's current length, the large object
is extended to the specified length with null bytes ('\0').
On success, lo_truncate
returns
zero. On error, the return value is -1.
The read/write location associated with the descriptor
fd
is not changed.
Although the len
parameter is declared as
size_t
, lo_truncate
will reject length
values larger than INT_MAX
.
When dealing with large objects that might exceed 2GB in size, instead use
int lo_truncate64(PGconn *conn, int fd, pg_int64 len);
This function has the same
behavior as lo_truncate
, but it can accept a
len
value exceeding 2GB.
lo_truncate
is new as of PostgreSQL
8.3; if this function is run against an older server version, it will
fail and return -1.
lo_truncate64
is new as of PostgreSQL
9.3; if this function is run against an older server version, it will
fail and return -1.
A large object descriptor can be closed by calling
int lo_close(PGconn *conn, int fd);
where fd
is a
large object descriptor returned by lo_open
.
On success, lo_close
returns zero. On
error, the return value is -1.
Any large object descriptors that remain open at the end of a transaction will be closed automatically.
Server-side functions tailored for manipulating large objects from SQL are listed in Table 35.1.
Table 35.1. SQL-Oriented Large Object Functions
There are additional server-side functions corresponding to each of the
client-side functions described earlier; indeed, for the most part the
client-side functions are simply interfaces to the equivalent server-side
functions. The ones just as convenient to call via SQL commands are
lo_creat
,
lo_create
,
lo_unlink
,
lo_import
, and
lo_export
.
Here are examples of their use:
CREATE TABLE image ( name text, raster oid ); SELECT lo_creat(-1); -- returns OID of new, empty large object SELECT lo_create(43213); -- attempts to create large object with OID 43213 SELECT lo_unlink(173454); -- deletes large object with OID 173454 INSERT INTO image (name, raster) VALUES ('beautiful image', lo_import('/etc/motd')); INSERT INTO image (name, raster) -- same as above, but specify OID to use VALUES ('beautiful image', lo_import('/etc/motd', 68583)); SELECT lo_export(image.raster, '/tmp/motd') FROM image WHERE name = 'beautiful image';
The server-side lo_import
and
lo_export
functions behave considerably differently
from their client-side analogs. These two functions read and write files
in the server's file system, using the permissions of the database's
owning user. Therefore, by default their use is restricted to superusers.
In contrast, the client-side import and export functions read and write
files in the client's file system, using the permissions of the client
program. The client-side functions do not require any database
privileges, except the privilege to read or write the large object in
question.
It is possible to GRANT use of the
server-side lo_import
and lo_export
functions to non-superusers, but
careful consideration of the security implications is required. A
malicious user of such privileges could easily parlay them into becoming
superuser (for example by rewriting server configuration files), or could
attack the rest of the server's file system without bothering to obtain
database superuser privileges as such. Access to roles having
such privilege must therefore be guarded just as carefully as access to
superuser roles. Nonetheless, if use of
server-side lo_import
or lo_export
is needed for some routine task, it's
safer to use a role with such privileges than one with full superuser
privileges, as that helps to reduce the risk of damage from accidental
errors.
The functionality of lo_read
and
lo_write
is also available via server-side calls,
but the names of the server-side functions differ from the client side
interfaces in that they do not contain underscores. You must call
these functions as loread
and lowrite
.
Example 35.1 is a sample program which shows how the large object
interface
in libpq can be used. Parts of the program are
commented out but are left in the source for the reader's
benefit. This program can also be found in
src/test/examples/testlo.c
in the source distribution.
Example 35.1. Large Objects with libpq Example Program
/*----------------------------------------------------------------- * * testlo.c * test using large objects with libpq * * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * * IDENTIFICATION * src/test/examples/testlo.c * *----------------------------------------------------------------- */ #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include "libpq-fe.h" #include "libpq/libpq-fs.h" #define BUFSIZE 1024 /* * importFile - * import file "in_filename" into database as large object "lobjOid" * */ static Oid importFile(PGconn *conn, char *filename) { Oid lobjId; int lobj_fd; char buf[BUFSIZE]; int nbytes, tmp; int fd; /* * open the file to be read in */ fd = open(filename, O_RDONLY, 0666); if (fd < 0) { /* error */ fprintf(stderr, "cannot open unix file\"%s\"\n", filename); } /* * create the large object */ lobjId = lo_creat(conn, INV_READ | INV_WRITE); if (lobjId == 0) fprintf(stderr, "cannot create large object"); lobj_fd = lo_open(conn, lobjId, INV_WRITE); /* * read in from the Unix file and write to the inversion file */ while ((nbytes = read(fd, buf, BUFSIZE)) > 0) { tmp = lo_write(conn, lobj_fd, buf, nbytes); if (tmp < nbytes) fprintf(stderr, "error while reading \"%s\"", filename); } close(fd); lo_close(conn, lobj_fd); return lobjId; } static void pickout(PGconn *conn, Oid lobjId, int start, int len) { int lobj_fd; char *buf; int nbytes; int nread; lobj_fd = lo_open(conn, lobjId, INV_READ); if (lobj_fd < 0) fprintf(stderr, "cannot open large object %u", lobjId); lo_lseek(conn, lobj_fd, start, SEEK_SET); buf = malloc(len + 1); nread = 0; while (len - nread > 0) { nbytes = lo_read(conn, lobj_fd, buf, len - nread); buf[nbytes] = '\0'; fprintf(stderr, ">>> %s", buf); nread += nbytes; if (nbytes <= 0) break; /* no more data? */ } free(buf); fprintf(stderr, "\n"); lo_close(conn, lobj_fd); } static void overwrite(PGconn *conn, Oid lobjId, int start, int len) { int lobj_fd; char *buf; int nbytes; int nwritten; int i; lobj_fd = lo_open(conn, lobjId, INV_WRITE); if (lobj_fd < 0) fprintf(stderr, "cannot open large object %u", lobjId); lo_lseek(conn, lobj_fd, start, SEEK_SET); buf = malloc(len + 1); for (i = 0; i < len; i++) buf[i] = 'X'; buf[i] = '\0'; nwritten = 0; while (len - nwritten > 0) { nbytes = lo_write(conn, lobj_fd, buf + nwritten, len - nwritten); nwritten += nbytes; if (nbytes <= 0) { fprintf(stderr, "\nWRITE FAILED!\n"); break; } } free(buf); fprintf(stderr, "\n"); lo_close(conn, lobj_fd); } /* * exportFile - * export large object "lobjOid" to file "out_filename" * */ static void exportFile(PGconn *conn, Oid lobjId, char *filename) { int lobj_fd; char buf[BUFSIZE]; int nbytes, tmp; int fd; /* * open the large object */ lobj_fd = lo_open(conn, lobjId, INV_READ); if (lobj_fd < 0) fprintf(stderr, "cannot open large object %u", lobjId); /* * open the file to be written to */ fd = open(filename, O_CREAT | O_WRONLY | O_TRUNC, 0666); if (fd < 0) { /* error */ fprintf(stderr, "cannot open unix file\"%s\"", filename); } /* * read in from the inversion file and write to the Unix file */ while ((nbytes = lo_read(conn, lobj_fd, buf, BUFSIZE)) > 0) { tmp = write(fd, buf, nbytes); if (tmp < nbytes) { fprintf(stderr, "error while writing \"%s\"", filename); } } lo_close(conn, lobj_fd); close(fd); } static void exit_nicely(PGconn *conn) { PQfinish(conn); exit(1); } int main(int argc, char **argv) { char *in_filename, *out_filename; char *database; Oid lobjOid; PGconn *conn; PGresult *res; if (argc != 4) { fprintf(stderr, "Usage: %s database_name in_filename out_filename\n", argv[0]); exit(1); } database = argv[1]; in_filename = argv[2]; out_filename = argv[3]; /* * set up the connection */ conn = PQsetdb(NULL, NULL, NULL, NULL, database); /* check to see that the backend connection was successfully made */ if (PQstatus(conn) != CONNECTION_OK) { fprintf(stderr, "%s", PQerrorMessage(conn)); exit_nicely(conn); } /* Set always-secure search path, so malicious users can't take control. */ res = PQexec(conn, "SELECT pg_catalog.set_config('search_path', '', false)"); if (PQresultStatus(res) != PGRES_TUPLES_OK) { fprintf(stderr, "SET failed: %s", PQerrorMessage(conn)); PQclear(res); exit_nicely(conn); } PQclear(res); res = PQexec(conn, "begin"); PQclear(res); printf("importing file \"%s\" ...\n", in_filename); /* lobjOid = importFile(conn, in_filename); */ lobjOid = lo_import(conn, in_filename); if (lobjOid == 0) fprintf(stderr, "%s\n", PQerrorMessage(conn)); else { printf("\tas large object %u.\n", lobjOid); printf("picking out bytes 1000-2000 of the large object\n"); pickout(conn, lobjOid, 1000, 1000); printf("overwriting bytes 1000-2000 of the large object with X's\n"); overwrite(conn, lobjOid, 1000, 1000); printf("exporting large object to file \"%s\" ...\n", out_filename); /* exportFile(conn, lobjOid, out_filename); */ if (lo_export(conn, lobjOid, out_filename) < 0) fprintf(stderr, "%s\n", PQerrorMessage(conn)); } res = PQexec(conn, "end"); PQclear(res); PQfinish(conn); return 0; }
Table of Contents
This chapter describes the embedded SQL package
for PostgreSQL. It was written by
Linus Tolke (<linus@epact.se>
) and Michael Meskes
(<meskes@postgresql.org>
). Originally it was written to work with
C. It also works with C++, but
it does not recognize all C++ constructs yet.
This documentation is quite incomplete. But since this interface is standardized, additional information can be found in many resources about SQL.
An embedded SQL program consists of code written in an ordinary
programming language, in this case C, mixed with SQL commands in
specially marked sections. To build the program, the source code (*.pgc
)
is first passed through the embedded SQL preprocessor, which converts it
to an ordinary C program (*.c
), and afterwards it can be processed by a C
compiler. (For details about the compiling and linking see Section 36.10.)
Converted ECPG applications call functions in the libpq library
through the embedded SQL library (ecpglib), and communicate with
the PostgreSQL server using the normal frontend-backend protocol.
Embedded SQL has advantages over other methods for handling SQL commands from C code. First, it takes care of the tedious passing of information to and from variables in your C program. Second, the SQL code in the program is checked at build time for syntactical correctness. Third, embedded SQL in C is specified in the SQL standard and supported by many other SQL database systems. The PostgreSQL implementation is designed to match this standard as much as possible, and it is usually possible to port embedded SQL programs written for other SQL databases to PostgreSQL with relative ease.
As already stated, programs written for the embedded SQL interface are normal C programs with special code inserted to perform database-related actions. This special code always has the form:
EXEC SQL ...;
These statements syntactically take the place of a C statement. Depending on the particular statement, they can appear at the global level or within a function.
Embedded
SQL statements follow the case-sensitivity rules of
normal SQL code, and not those of C. Also they allow nested
C-style comments as per the SQL standard. The C part of the
program, however, follows the C standard of not accepting nested comments.
Embedded SQL statements likewise use SQL rules, not
C rules, for parsing quoted strings and identifiers.
(See Section 4.1.2.1 and
Section 4.1.1 respectively. Note that
ECPG assumes that standard_conforming_strings
is on
.)
Of course, the C part of the program follows C quoting rules.
The following sections explain all the embedded SQL statements.
This section describes how to open, close, and switch database connections.
One connects to a database using the following statement:
EXEC SQL CONNECT TOtarget
[ASconnection-name
] [USERuser-name
];
The target
can be specified in the
following ways:
dbname
[@hostname
][:port
]
tcp:postgresql://hostname
[:port
][/dbname
][?options
]
unix:postgresql://localhost[:port
][/dbname
][?options
]
DEFAULT
The connection target DEFAULT
initiates a connection
to the default database under the default user name. No separate
user name or connection name can be specified in that case.
If you specify the connection target directly (that is, not as a string
literal or variable reference), then the components of the target are
passed through normal SQL parsing; this means that, for example,
the hostname
must look like one or more SQL
identifiers separated by dots, and those identifiers will be
case-folded unless double-quoted. Values of
any options
must be SQL identifiers,
integers, or variable references. Of course, you can put nearly
anything into an SQL identifier by double-quoting it.
In practice, it is probably less error-prone to use a (single-quoted)
string literal or a variable reference than to write the connection
target directly.
There are also different ways to specify the user name:
username
username
/password
username
IDENTIFIED BY password
username
USING password
As above, the parameters username
and
password
can be an SQL identifier, an
SQL string literal, or a reference to a character variable.
If the connection target includes any options
,
those consist of
specifications separated by ampersands (keyword
=value
&
).
The allowed key words are the same ones recognized
by libpq (see
Section 34.1.2). Spaces are ignored before
any keyword
or value
,
though not within or after one. Note that there is no way to
write &
within a value
.
Notice that when specifying a socket connection
(with the unix:
prefix), the host name must be
exactly localhost
. To select a non-default
socket directory, write the directory's pathname as the value of
a host
option in
the options
part of the target.
The connection-name
is used to handle
multiple connections in one program. It can be omitted if a
program uses only one connection. The most recently opened
connection becomes the current connection, which is used by default
when an SQL statement is to be executed (see later in this
chapter).
Here are some examples of CONNECT
statements:
EXEC SQL CONNECT TO mydb@sql.mydomain.com; EXEC SQL CONNECT TO tcp:postgresql://sql.mydomain.com/mydb AS myconnection USER john; EXEC SQL BEGIN DECLARE SECTION; const char *target = "mydb@sql.mydomain.com"; const char *user = "john"; const char *passwd = "secret"; EXEC SQL END DECLARE SECTION; ... EXEC SQL CONNECT TO :target USER :user USING :passwd; /* or EXEC SQL CONNECT TO :target USER :user/:passwd; */
The last example makes use of the feature referred to above as character variable references. You will see in later sections how C variables can be used in SQL statements when you prefix them with a colon.
Be advised that the format of the connection target is not specified in the SQL standard. So if you want to develop portable applications, you might want to use something based on the last example above to encapsulate the connection target string somewhere.
If untrusted users have access to a database that has not adopted a
secure schema usage pattern,
begin each session by removing publicly-writable schemas
from search_path
. For example,
add options=-c search_path=
to
, or
issue options
EXEC SQL SELECT pg_catalog.set_config('search_path', '',
false);
after connecting. This consideration is not specific to
ECPG; it applies to every interface for executing arbitrary SQL commands.
SQL statements in embedded SQL programs are by default executed on the current connection, that is, the most recently opened one. If an application needs to manage multiple connections, then there are three ways to handle this.
The first option is to explicitly choose a connection for each SQL statement, for example:
EXEC SQL AT connection-name
SELECT ...;
This option is particularly suitable if the application needs to use several connections in mixed order.
If your application uses multiple threads of execution, they cannot share a connection concurrently. You must either explicitly control access to the connection (using mutexes) or use a connection for each thread.
The second option is to execute a statement to switch the current connection. That statement is:
EXEC SQL SET CONNECTION connection-name
;
This option is particularly convenient if many statements are to be executed on the same connection.
Here is an example program managing multiple database connections:
#include <stdio.h> EXEC SQL BEGIN DECLARE SECTION; char dbname[1024]; EXEC SQL END DECLARE SECTION; int main() { EXEC SQL CONNECT TO testdb1 AS con1 USER testuser; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL CONNECT TO testdb2 AS con2 USER testuser; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL CONNECT TO testdb3 AS con3 USER testuser; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; /* This query would be executed in the last opened database "testdb3". */ EXEC SQL SELECT current_database() INTO :dbname; printf("current=%s (should be testdb3)\n", dbname); /* Using "AT" to run a query in "testdb2" */ EXEC SQL AT con2 SELECT current_database() INTO :dbname; printf("current=%s (should be testdb2)\n", dbname); /* Switch the current connection to "testdb1". */ EXEC SQL SET CONNECTION con1; EXEC SQL SELECT current_database() INTO :dbname; printf("current=%s (should be testdb1)\n", dbname); EXEC SQL DISCONNECT ALL; return 0; }
This example would produce this output:
current=testdb3 (should be testdb3) current=testdb2 (should be testdb2) current=testdb1 (should be testdb1)
The third option is to declare an SQL identifier linked to the connection, for example:
EXEC SQL ATconnection-name
DECLAREstatement-name
STATEMENT; EXEC SQL PREPAREstatement-name
FROM :dyn-string
;
Once you link an SQL identifier to a connection, you execute dynamic SQL without an AT clause. Note that this option behaves like preprocessor directives, therefore the link is enabled only in the file.
Here is an example program using this option:
#include <stdio.h> EXEC SQL BEGIN DECLARE SECTION; char dbname[128]; char *dyn_sql = "SELECT current_database()"; EXEC SQL END DECLARE SECTION; int main(){ EXEC SQL CONNECT TO postgres AS con1; EXEC SQL CONNECT TO testdb AS con2; EXEC SQL AT con1 DECLARE stmt STATEMENT; EXEC SQL PREPARE stmt FROM :dyn_sql; EXEC SQL EXECUTE stmt INTO :dbname; printf("%s\n", dbname); EXEC SQL DISCONNECT ALL; return 0; }
This example would produce this output, even if the default connection is testdb:
postgres
To close a connection, use the following statement:
EXEC SQL DISCONNECT [connection
];
The connection
can be specified
in the following ways:
connection-name
CURRENT
ALL
If no connection name is specified, the current connection is closed.
It is good style that an application always explicitly disconnect from every connection it opened.
Any SQL command can be run from within an embedded SQL application. Below are some examples of how to do that.
Creating a table:
EXEC SQL CREATE TABLE foo (number integer, ascii char(16)); EXEC SQL CREATE UNIQUE INDEX num1 ON foo(number); EXEC SQL COMMIT;
Inserting rows:
EXEC SQL INSERT INTO foo (number, ascii) VALUES (9999, 'doodad'); EXEC SQL COMMIT;
Deleting rows:
EXEC SQL DELETE FROM foo WHERE number = 9999; EXEC SQL COMMIT;
Updates:
EXEC SQL UPDATE foo SET ascii = 'foobar' WHERE number = 9999; EXEC SQL COMMIT;
SELECT
statements that return a single result
row can also be executed using
EXEC SQL
directly. To handle result sets with
multiple rows, an application has to use a cursor;
see Section 36.3.2 below. (As a special case, an
application can fetch multiple rows at once into an array host
variable; see Section 36.4.4.3.1.)
Single-row select:
EXEC SQL SELECT foo INTO :FooBar FROM table1 WHERE ascii = 'doodad';
Also, a configuration parameter can be retrieved with the
SHOW
command:
EXEC SQL SHOW search_path INTO :var;
The tokens of the form
:
are
host variables, that is, they refer to
variables in the C program. They are explained in Section 36.4.
something
To retrieve a result set holding multiple rows, an application has to declare a cursor and fetch each row from the cursor. The steps to use a cursor are the following: declare a cursor, open it, fetch a row from the cursor, repeat, and finally close it.
Select using cursors:
EXEC SQL DECLARE foo_bar CURSOR FOR SELECT number, ascii FROM foo ORDER BY ascii; EXEC SQL OPEN foo_bar; EXEC SQL FETCH foo_bar INTO :FooBar, DooDad; ... EXEC SQL CLOSE foo_bar; EXEC SQL COMMIT;
For more details about declaring a cursor, see DECLARE; for more details about fetching rows from a cursor, see FETCH.
The ECPG DECLARE
command does not actually
cause a statement to be sent to the PostgreSQL backend. The
cursor is opened in the backend (using the
backend's DECLARE
command) at the point when
the OPEN
command is executed.
In the default mode, statements are committed only when
EXEC SQL COMMIT
is issued. The embedded SQL
interface also supports autocommit of transactions (similar to
psql's default behavior) via the -t
command-line option to ecpg
(see ecpg) or via the EXEC SQL SET AUTOCOMMIT TO
ON
statement. In autocommit mode, each command is
automatically committed unless it is inside an explicit transaction
block. This mode can be explicitly turned off using EXEC
SQL SET AUTOCOMMIT TO OFF
.
The following transaction management commands are available:
EXEC SQL COMMIT
Commit an in-progress transaction.
EXEC SQL ROLLBACK
Roll back an in-progress transaction.
EXEC SQL PREPARE TRANSACTION
transaction_id
Prepare the current transaction for two-phase commit.
EXEC SQL COMMIT PREPARED
transaction_id
Commit a transaction that is in prepared state.
EXEC SQL ROLLBACK PREPARED
transaction_id
Roll back a transaction that is in prepared state.
EXEC SQL SET AUTOCOMMIT TO ON
Enable autocommit mode.
EXEC SQL SET AUTOCOMMIT TO OFF
Disable autocommit mode. This is the default.
When the values to be passed to an SQL statement are not known at compile time, or the same statement is going to be used many times, then prepared statements can be useful.
The statement is prepared using the
command PREPARE
. For the values that are not
known yet, use the
placeholder “?
”:
EXEC SQL PREPARE stmt1 FROM "SELECT oid, datname FROM pg_database WHERE oid = ?";
If a statement returns a single row, the application can
call EXECUTE
after
PREPARE
to execute the statement, supplying the
actual values for the placeholders with a USING
clause:
EXEC SQL EXECUTE stmt1 INTO :dboid, :dbname USING 1;
If a statement returns multiple rows, the application can use a
cursor declared based on the prepared statement. To bind input
parameters, the cursor must be opened with
a USING
clause:
EXEC SQL PREPARE stmt1 FROM "SELECT oid,datname FROM pg_database WHERE oid > ?"; EXEC SQL DECLARE foo_bar CURSOR FOR stmt1; /* when end of result set reached, break out of while loop */ EXEC SQL WHENEVER NOT FOUND DO BREAK; EXEC SQL OPEN foo_bar USING 100; ... while (1) { EXEC SQL FETCH NEXT FROM foo_bar INTO :dboid, :dbname; ... } EXEC SQL CLOSE foo_bar;
When you don't need the prepared statement anymore, you should deallocate it:
EXEC SQL DEALLOCATE PREPARE name
;
For more details about PREPARE
,
see PREPARE. Also
see Section 36.5 for more details about using
placeholders and input parameters.
In Section 36.3 you saw how you can execute SQL statements from an embedded SQL program. Some of those statements only used fixed values and did not provide a way to insert user-supplied values into statements or have the program process the values returned by the query. Those kinds of statements are not really useful in real applications. This section explains in detail how you can pass data between your C program and the embedded SQL statements using a simple mechanism called host variables. In an embedded SQL program we consider the SQL statements to be guests in the C program code which is the host language. Therefore the variables of the C program are called host variables.
Another way to exchange values between PostgreSQL backends and ECPG applications is the use of SQL descriptors, described in Section 36.7.
Passing data between the C program and the SQL statements is particularly simple in embedded SQL. Instead of having the program paste the data into the statement, which entails various complications, such as properly quoting the value, you can simply write the name of a C variable into the SQL statement, prefixed by a colon. For example:
EXEC SQL INSERT INTO sometable VALUES (:v1, 'foo', :v2);
This statement refers to two C variables named
v1
and v2
and also uses a
regular SQL string literal, to illustrate that you are not
restricted to use one kind of data or the other.
This style of inserting C variables in SQL statements works anywhere a value expression is expected in an SQL statement.
To pass data from the program to the database, for example as parameters in a query, or to pass data from the database back to the program, the C variables that are intended to contain this data need to be declared in specially marked sections, so the embedded SQL preprocessor is made aware of them.
This section starts with:
EXEC SQL BEGIN DECLARE SECTION;
and ends with:
EXEC SQL END DECLARE SECTION;
Between those lines, there must be normal C variable declarations, such as:
int x = 4; char foo[16], bar[16];
As you can see, you can optionally assign an initial value to the variable. The variable's scope is determined by the location of its declaring section within the program. You can also declare variables with the following syntax which implicitly creates a declare section:
EXEC SQL int i = 4;
You can have as many declare sections in a program as you like.
The declarations are also echoed to the output file as normal C variables, so there's no need to declare them again. Variables that are not intended to be used in SQL commands can be declared normally outside these special sections.
The definition of a structure or union also must be listed inside
a DECLARE
section. Otherwise the preprocessor cannot
handle these types since it does not know the definition.
Now you should be able to pass data generated by your program into
an SQL command. But how do you retrieve the results of a query?
For that purpose, embedded SQL provides special variants of the
usual commands SELECT
and
FETCH
. These commands have a special
INTO
clause that specifies which host variables
the retrieved values are to be stored in.
SELECT
is used for a query that returns only
single row, and FETCH
is used for a query that
returns multiple rows, using a cursor.
Here is an example:
/* * assume this table: * CREATE TABLE test1 (a int, b varchar(50)); */ EXEC SQL BEGIN DECLARE SECTION; int v1; VARCHAR v2; EXEC SQL END DECLARE SECTION; ... EXEC SQL SELECT a, b INTO :v1, :v2 FROM test;
So the INTO
clause appears between the select
list and the FROM
clause. The number of
elements in the select list and the list after
INTO
(also called the target list) must be
equal.
Here is an example using the command FETCH
:
EXEC SQL BEGIN DECLARE SECTION; int v1; VARCHAR v2; EXEC SQL END DECLARE SECTION; ... EXEC SQL DECLARE foo CURSOR FOR SELECT a, b FROM test; ... do { ... EXEC SQL FETCH NEXT FROM foo INTO :v1, :v2; ... } while (...);
Here the INTO
clause appears after all the
normal clauses.
When ECPG applications exchange values between the PostgreSQL server and the C application, such as when retrieving query results from the server or executing SQL statements with input parameters, the values need to be converted between PostgreSQL data types and host language variable types (C language data types, concretely). One of the main points of ECPG is that it takes care of this automatically in most cases.
In this respect, there are two kinds of data types: Some simple
PostgreSQL data types, such as integer
and text
, can be read and written by the application
directly. Other PostgreSQL data types, such
as timestamp
and numeric
can only be
accessed through special library functions; see
Section 36.4.4.2.
Table 36.1 shows which PostgreSQL data types correspond to which C data types. When you wish to send or receive a value of a given PostgreSQL data type, you should declare a C variable of the corresponding C data type in the declare section.
Table 36.1. Mapping Between PostgreSQL Data Types and C Variable Types
PostgreSQL data type | Host variable type |
---|---|
smallint | short |
integer | int |
bigint | long long int |
decimal | decimal [a] |
numeric | numeric [a] |
real | float |
double precision | double |
smallserial | short |
serial | int |
bigserial | long long int |
oid | unsigned int |
character( , varchar( , text | char[ , VARCHAR[ |
name | char[NAMEDATALEN] |
timestamp | timestamp [a] |
interval | interval [a] |
date | date [a] |
boolean | bool [b] |
bytea | char * , bytea[ |
[a] This type can only be accessed through special library functions; see Section 36.4.4.2. [b] declared in |
To handle SQL character string data types, such
as varchar
and text
, there are two
possible ways to declare the host variables.
One way is using char[]
, an array
of char
, which is the most common way to handle
character data in C.
EXEC SQL BEGIN DECLARE SECTION; char str[50]; EXEC SQL END DECLARE SECTION;
Note that you have to take care of the length yourself. If you use this host variable as the target variable of a query which returns a string with more than 49 characters, a buffer overflow occurs.
The other way is using the VARCHAR
type, which is a
special type provided by ECPG. The definition on an array of
type VARCHAR
is converted into a
named struct
for every variable. A declaration like:
VARCHAR var[180];
is converted into:
struct varchar_var { int len; char arr[180]; } var;
The member arr
hosts the string
including a terminating zero byte. Thus, to store a string in
a VARCHAR
host variable, the host variable has to be
declared with the length including the zero byte terminator. The
member len
holds the length of the
string stored in the arr
without the
terminating zero byte. When a host variable is used as input for
a query, if strlen(arr)
and len
are different, the shorter one
is used.
VARCHAR
can be written in upper or lower case, but
not in mixed case.
char
and VARCHAR
host variables can
also hold values of other SQL types, which will be stored in
their string forms.
ECPG contains some special types that help you to interact easily
with some special data types from the PostgreSQL server. In
particular, it has implemented support for the
numeric
, decimal
, date
, timestamp
,
and interval
types. These data types cannot usefully be
mapped to primitive host variable types (such
as int
, long long int
,
or char[]
), because they have a complex internal
structure. Applications deal with these types by declaring host
variables in special types and accessing them using functions in
the pgtypes library. The pgtypes library, described in detail
in Section 36.6 contains basic functions to deal
with those types, such that you do not need to send a query to
the SQL server just for adding an interval to a time stamp for
example.
The follow subsections describe these special data types. For more details about pgtypes library functions, see Section 36.6.
Here is a pattern for handling timestamp
variables
in the ECPG host application.
First, the program has to include the header file for the
timestamp
type:
#include <pgtypes_timestamp.h>
Next, declare a host variable as type timestamp
in
the declare section:
EXEC SQL BEGIN DECLARE SECTION; timestamp ts; EXEC SQL END DECLARE SECTION;
And after reading a value into the host variable, process it
using pgtypes library functions. In following example, the
timestamp
value is converted into text (ASCII) form
with the PGTYPEStimestamp_to_asc()
function:
EXEC SQL SELECT now()::timestamp INTO :ts; printf("ts = %s\n", PGTYPEStimestamp_to_asc(ts));
This example will show some result like following:
ts = 2010-06-27 18:03:56.949343
In addition, the DATE type can be handled in the same way. The
program has to include pgtypes_date.h
, declare a host variable
as the date type and convert a DATE value into a text form using
PGTYPESdate_to_asc()
function. For more details about the
pgtypes library functions, see Section 36.6.
The handling of the interval
type is also similar
to the timestamp
and date
types. It
is required, however, to allocate memory for
an interval
type value explicitly. In other words,
the memory space for the variable has to be allocated in the
heap memory, not in the stack memory.
Here is an example program:
#include <stdio.h> #include <stdlib.h> #include <pgtypes_interval.h> int main(void) { EXEC SQL BEGIN DECLARE SECTION; interval *in; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO testdb; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; in = PGTYPESinterval_new(); EXEC SQL SELECT '1 min'::interval INTO :in; printf("interval = %s\n", PGTYPESinterval_to_asc(in)); PGTYPESinterval_free(in); EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL; return 0; }
The handling of the numeric
and decimal
types is similar to the
interval
type: It requires defining a pointer,
allocating some memory space on the heap, and accessing the
variable using the pgtypes library functions. For more details
about the pgtypes library functions,
see Section 36.6.
No functions are provided specifically for
the decimal
type. An application has to convert it
to a numeric
variable using a pgtypes library
function to do further processing.
Here is an example program handling numeric
and decimal
type variables.
#include <stdio.h> #include <stdlib.h> #include <pgtypes_numeric.h> EXEC SQL WHENEVER SQLERROR STOP; int main(void) { EXEC SQL BEGIN DECLARE SECTION; numeric *num; numeric *num2; decimal *dec; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO testdb; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; num = PGTYPESnumeric_new(); dec = PGTYPESdecimal_new(); EXEC SQL SELECT 12.345::numeric(4,2), 23.456::decimal(4,2) INTO :num, :dec; printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 0)); printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 1)); printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 2)); /* Convert decimal to numeric to show a decimal value. */ num2 = PGTYPESnumeric_new(); PGTYPESnumeric_from_decimal(dec, num2); printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 0)); printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 1)); printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 2)); PGTYPESnumeric_free(num2); PGTYPESdecimal_free(dec); PGTYPESnumeric_free(num); EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL; return 0; }
The handling of the bytea
type is similar to
that of VARCHAR
. The definition on an array of type
bytea
is converted into a named struct for every
variable. A declaration like:
bytea var[180];
is converted into:
struct bytea_var { int len; char arr[180]; } var;
The member arr
hosts binary format
data. It can also handle '\0'
as part of
data, unlike VARCHAR
.
The data is converted from/to hex format and sent/received by
ecpglib.
bytea
variable can be used only when
bytea_output is set to hex
.
As a host variable you can also use arrays, typedefs, structs, and pointers.
There are two use cases for arrays as host variables. The first
is a way to store some text string in char[]
or VARCHAR[]
, as
explained in Section 36.4.4.1. The second use case is to
retrieve multiple rows from a query result without using a
cursor. Without an array, to process a query result consisting
of multiple rows, it is required to use a cursor and
the FETCH
command. But with array host
variables, multiple rows can be received at once. The length of
the array has to be defined to be able to accommodate all rows,
otherwise a buffer overflow will likely occur.
Following example scans the pg_database
system table and shows all OIDs and names of the available
databases:
int main(void) { EXEC SQL BEGIN DECLARE SECTION; int dbid[8]; char dbname[8][16]; int i; EXEC SQL END DECLARE SECTION; memset(dbname, 0, sizeof(char)* 16 * 8); memset(dbid, 0, sizeof(int) * 8); EXEC SQL CONNECT TO testdb; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; /* Retrieve multiple rows into arrays at once. */ EXEC SQL SELECT oid,datname INTO :dbid, :dbname FROM pg_database; for (i = 0; i < 8; i++) printf("oid=%d, dbname=%s\n", dbid[i], dbname[i]); EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL; return 0; }
This example shows following result. (The exact values depend on local circumstances.)
oid=1, dbname=template1 oid=11510, dbname=template0 oid=11511, dbname=postgres oid=313780, dbname=testdb oid=0, dbname= oid=0, dbname= oid=0, dbname=
A structure whose member names match the column names of a query result, can be used to retrieve multiple columns at once. The structure enables handling multiple column values in a single host variable.
The following example retrieves OIDs, names, and sizes of the
available databases from the pg_database
system table and using
the pg_database_size()
function. In this
example, a structure variable dbinfo_t
with
members whose names match each column in
the SELECT
result is used to retrieve one
result row without putting multiple host variables in
the FETCH
statement.
EXEC SQL BEGIN DECLARE SECTION; typedef struct { int oid; char datname[65]; long long int size; } dbinfo_t; dbinfo_t dbval; EXEC SQL END DECLARE SECTION; memset(&dbval, 0, sizeof(dbinfo_t)); EXEC SQL DECLARE cur1 CURSOR FOR SELECT oid, datname, pg_database_size(oid) AS size FROM pg_database; EXEC SQL OPEN cur1; /* when end of result set reached, break out of while loop */ EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { /* Fetch multiple columns into one structure. */ EXEC SQL FETCH FROM cur1 INTO :dbval; /* Print members of the structure. */ printf("oid=%d, datname=%s, size=%lld\n", dbval.oid, dbval.datname, dbval.size); } EXEC SQL CLOSE cur1;
This example shows following result. (The exact values depend on local circumstances.)
oid=1, datname=template1, size=4324580 oid=11510, datname=template0, size=4243460 oid=11511, datname=postgres, size=4324580 oid=313780, datname=testdb, size=8183012
Structure host variables “absorb” as many columns
as the structure as fields. Additional columns can be assigned
to other host variables. For example, the above program could
also be restructured like this, with the size
variable outside the structure:
EXEC SQL BEGIN DECLARE SECTION; typedef struct { int oid; char datname[65]; } dbinfo_t; dbinfo_t dbval; long long int size; EXEC SQL END DECLARE SECTION; memset(&dbval, 0, sizeof(dbinfo_t)); EXEC SQL DECLARE cur1 CURSOR FOR SELECT oid, datname, pg_database_size(oid) AS size FROM pg_database; EXEC SQL OPEN cur1; /* when end of result set reached, break out of while loop */ EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { /* Fetch multiple columns into one structure. */ EXEC SQL FETCH FROM cur1 INTO :dbval, :size; /* Print members of the structure. */ printf("oid=%d, datname=%s, size=%lld\n", dbval.oid, dbval.datname, size); } EXEC SQL CLOSE cur1;
Use the typedef
keyword to map new types to already
existing types.
EXEC SQL BEGIN DECLARE SECTION; typedef char mychartype[40]; typedef long serial_t; EXEC SQL END DECLARE SECTION;
Note that you could also use:
EXEC SQL TYPE serial_t IS long;
This declaration does not need to be part of a declare section.
You can declare pointers to the most common types. Note however that you cannot use pointers as target variables of queries without auto-allocation. See Section 36.7 for more information on auto-allocation.
EXEC SQL BEGIN DECLARE SECTION; int *intp; char **charp; EXEC SQL END DECLARE SECTION;
This section contains information on how to handle nonscalar and user-defined SQL-level data types in ECPG applications. Note that this is distinct from the handling of host variables of nonprimitive types, described in the previous section.
Multi-dimensional SQL-level arrays are not directly supported in ECPG. One-dimensional SQL-level arrays can be mapped into C array host variables and vice-versa. However, when creating a statement ecpg does not know the types of the columns, so that it cannot check if a C array is input into a corresponding SQL-level array. When processing the output of an SQL statement, ecpg has the necessary information and thus checks if both are arrays.
If a query accesses elements of an array
separately, then this avoids the use of arrays in ECPG. Then, a
host variable with a type that can be mapped to the element type
should be used. For example, if a column type is array of
integer
, a host variable of type int
can be used. Also if the element type is varchar
or text
, a host variable of type char[]
or VARCHAR[]
can be used.
Here is an example. Assume the following table:
CREATE TABLE t3 ( ii integer[] ); testdb=> SELECT * FROM t3; ii ------------- {1,2,3,4,5} (1 row)
The following example program retrieves the 4th element of the
array and stores it into a host variable of
type int
:
EXEC SQL BEGIN DECLARE SECTION; int ii; EXEC SQL END DECLARE SECTION; EXEC SQL DECLARE cur1 CURSOR FOR SELECT ii[4] FROM t3; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { EXEC SQL FETCH FROM cur1 INTO :ii ; printf("ii=%d\n", ii); } EXEC SQL CLOSE cur1;
This example shows the following result:
ii=4
To map multiple array elements to the multiple elements in an array type host variables each element of array column and each element of the host variable array have to be managed separately, for example:
EXEC SQL BEGIN DECLARE SECTION; int ii_a[8]; EXEC SQL END DECLARE SECTION; EXEC SQL DECLARE cur1 CURSOR FOR SELECT ii[1], ii[2], ii[3], ii[4] FROM t3; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { EXEC SQL FETCH FROM cur1 INTO :ii_a[0], :ii_a[1], :ii_a[2], :ii_a[3]; ... }
Note again that
EXEC SQL BEGIN DECLARE SECTION; int ii_a[8]; EXEC SQL END DECLARE SECTION; EXEC SQL DECLARE cur1 CURSOR FOR SELECT ii FROM t3; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { /* WRONG */ EXEC SQL FETCH FROM cur1 INTO :ii_a; ... }
would not work correctly in this case, because you cannot map an array type column to an array host variable directly.
Another workaround is to store arrays in their external string
representation in host variables of type char[]
or VARCHAR[]
. For more details about this
representation, see Section 8.15.2. Note that
this means that the array cannot be accessed naturally as an
array in the host program (without further processing that parses
the text representation).
Composite types are not directly supported in ECPG, but an easy workaround is possible. The available workarounds are similar to the ones described for arrays above: Either access each attribute separately or use the external string representation.
For the following examples, assume the following type and table:
CREATE TYPE comp_t AS (intval integer, textval varchar(32)); CREATE TABLE t4 (compval comp_t); INSERT INTO t4 VALUES ( (256, 'PostgreSQL') );
The most obvious solution is to access each attribute separately.
The following program retrieves data from the example table by
selecting each attribute of the type comp_t
separately:
EXEC SQL BEGIN DECLARE SECTION; int intval; varchar textval[33]; EXEC SQL END DECLARE SECTION; /* Put each element of the composite type column in the SELECT list. */ EXEC SQL DECLARE cur1 CURSOR FOR SELECT (compval).intval, (compval).textval FROM t4; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { /* Fetch each element of the composite type column into host variables. */ EXEC SQL FETCH FROM cur1 INTO :intval, :textval; printf("intval=%d, textval=%s\n", intval, textval.arr); } EXEC SQL CLOSE cur1;
To enhance this example, the host variables to store values in
the FETCH
command can be gathered into one
structure. For more details about the host variable in the
structure form, see Section 36.4.4.3.2.
To switch to the structure, the example can be modified as below.
The two host variables, intval
and textval
, become members of
the comp_t
structure, and the structure
is specified on the FETCH
command.
EXEC SQL BEGIN DECLARE SECTION; typedef struct { int intval; varchar textval[33]; } comp_t; comp_t compval; EXEC SQL END DECLARE SECTION; /* Put each element of the composite type column in the SELECT list. */ EXEC SQL DECLARE cur1 CURSOR FOR SELECT (compval).intval, (compval).textval FROM t4; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { /* Put all values in the SELECT list into one structure. */ EXEC SQL FETCH FROM cur1 INTO :compval; printf("intval=%d, textval=%s\n", compval.intval, compval.textval.arr); } EXEC SQL CLOSE cur1;
Although a structure is used in the FETCH
command, the attribute names in the SELECT
clause are specified one by one. This can be enhanced by using
a *
to ask for all attributes of the composite
type value.
... EXEC SQL DECLARE cur1 CURSOR FOR SELECT (compval).* FROM t4; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { /* Put all values in the SELECT list into one structure. */ EXEC SQL FETCH FROM cur1 INTO :compval; printf("intval=%d, textval=%s\n", compval.intval, compval.textval.arr); } ...
This way, composite types can be mapped into structures almost seamlessly, even though ECPG does not understand the composite type itself.
Finally, it is also possible to store composite type values in
their external string representation in host variables of
type char[]
or VARCHAR[]
. But that
way, it is not easily possible to access the fields of the value
from the host program.
New user-defined base types are not directly supported by ECPG.
You can use the external string representation and host variables
of type char[]
or VARCHAR[]
, and this
solution is indeed appropriate and sufficient for many types.
Here is an example using the data type complex
from
the example in Section 38.13. The external string
representation of that type is (%f,%f)
,
which is defined in the
functions complex_in()
and complex_out()
functions
in Section 38.13. The following example inserts the
complex type values (1,1)
and (3,3)
into the
columns a
and b
, and select
them from the table after that.
EXEC SQL BEGIN DECLARE SECTION; varchar a[64]; varchar b[64]; EXEC SQL END DECLARE SECTION; EXEC SQL INSERT INTO test_complex VALUES ('(1,1)', '(3,3)'); EXEC SQL DECLARE cur1 CURSOR FOR SELECT a, b FROM test_complex; EXEC SQL OPEN cur1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { EXEC SQL FETCH FROM cur1 INTO :a, :b; printf("a=%s, b=%s\n", a.arr, b.arr); } EXEC SQL CLOSE cur1;
This example shows following result:
a=(1,1), b=(3,3)
Another workaround is avoiding the direct use of the user-defined types in ECPG and instead create a function or cast that converts between the user-defined type and a primitive type that ECPG can handle. Note, however, that type casts, especially implicit ones, should be introduced into the type system very carefully.
For example,
CREATE FUNCTION create_complex(r double, i double) RETURNS complex LANGUAGE SQL IMMUTABLE AS $$ SELECT $1 * complex '(1,0')' + $2 * complex '(0,1)' $$;
After this definition, the following
EXEC SQL BEGIN DECLARE SECTION; double a, b, c, d; EXEC SQL END DECLARE SECTION; a = 1; b = 2; c = 3; d = 4; EXEC SQL INSERT INTO test_complex VALUES (create_complex(:a, :b), create_complex(:c, :d));
has the same effect as
EXEC SQL INSERT INTO test_complex VALUES ('(1,2)', '(3,4)');
The examples above do not handle null values. In fact, the retrieval examples will raise an error if they fetch a null value from the database. To be able to pass null values to the database or retrieve null values from the database, you need to append a second host variable specification to each host variable that contains data. This second host variable is called the indicator and contains a flag that tells whether the datum is null, in which case the value of the real host variable is ignored. Here is an example that handles the retrieval of null values correctly:
EXEC SQL BEGIN DECLARE SECTION; VARCHAR val; int val_ind; EXEC SQL END DECLARE SECTION: ... EXEC SQL SELECT b INTO :val :val_ind FROM test1;
The indicator variable val_ind
will be zero if
the value was not null, and it will be negative if the value was
null. (See Section 36.16 to enable
Oracle-specific behavior.)
The indicator has another function: if the indicator value is positive, it means that the value is not null, but it was truncated when it was stored in the host variable.
If the argument -r no_indicator
is passed to
the preprocessor ecpg
, it works in
“no-indicator” mode. In no-indicator mode, if no
indicator variable is specified, null values are signaled (on
input and output) for character string types as empty string and
for integer types as the lowest possible value for type (for
example, INT_MIN
for int
).
In many cases, the particular SQL statements that an application has to execute are known at the time the application is written. In some cases, however, the SQL statements are composed at run time or provided by an external source. In these cases you cannot embed the SQL statements directly into the C source code, but there is a facility that allows you to call arbitrary SQL statements that you provide in a string variable.
The simplest way to execute an arbitrary SQL statement is to use
the command EXECUTE IMMEDIATE
. For example:
EXEC SQL BEGIN DECLARE SECTION; const char *stmt = "CREATE TABLE test1 (...);"; EXEC SQL END DECLARE SECTION; EXEC SQL EXECUTE IMMEDIATE :stmt;
EXECUTE IMMEDIATE
can be used for SQL
statements that do not return a result set (e.g.,
DDL, INSERT
, UPDATE
,
DELETE
). You cannot execute statements that
retrieve data (e.g., SELECT
) this way. The
next section describes how to do that.
A more powerful way to execute arbitrary SQL statements is to prepare them once and execute the prepared statement as often as you like. It is also possible to prepare a generalized version of a statement and then execute specific versions of it by substituting parameters. When preparing the statement, write question marks where you want to substitute parameters later. For example:
EXEC SQL BEGIN DECLARE SECTION; const char *stmt = "INSERT INTO test1 VALUES(?, ?);"; EXEC SQL END DECLARE SECTION; EXEC SQL PREPARE mystmt FROM :stmt; ... EXEC SQL EXECUTE mystmt USING 42, 'foobar';
When you don't need the prepared statement anymore, you should deallocate it:
EXEC SQL DEALLOCATE PREPARE name
;
To execute an SQL statement with a single result row,
EXECUTE
can be used. To save the result, add
an INTO
clause.
EXEC SQL BEGIN DECLARE SECTION; const char *stmt = "SELECT a, b, c FROM test1 WHERE a > ?"; int v1, v2; VARCHAR v3[50]; EXEC SQL END DECLARE SECTION; EXEC SQL PREPARE mystmt FROM :stmt; ... EXEC SQL EXECUTE mystmt INTO :v1, :v2, :v3 USING 37;
An EXECUTE
command can have an
INTO
clause, a USING
clause,
both, or neither.
If a query is expected to return more than one result row, a cursor should be used, as in the following example. (See Section 36.3.2 for more details about the cursor.)
EXEC SQL BEGIN DECLARE SECTION; char dbaname[128]; char datname[128]; char *stmt = "SELECT u.usename as dbaname, d.datname " " FROM pg_database d, pg_user u " " WHERE d.datdba = u.usesysid"; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO testdb AS con1 USER testuser; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL PREPARE stmt1 FROM :stmt; EXEC SQL DECLARE cursor1 CURSOR FOR stmt1; EXEC SQL OPEN cursor1; EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { EXEC SQL FETCH cursor1 INTO :dbaname,:datname; printf("dbaname=%s, datname=%s\n", dbaname, datname); } EXEC SQL CLOSE cursor1; EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL;
The pgtypes library maps PostgreSQL database types to C equivalents that can be used in C programs. It also offers functions to do basic calculations with those types within C, i.e., without the help of the PostgreSQL server. See the following example:
EXEC SQL BEGIN DECLARE SECTION; date date1; timestamp ts1, tsout; interval iv1; char *out; EXEC SQL END DECLARE SECTION; PGTYPESdate_today(&date1); EXEC SQL SELECT started, duration INTO :ts1, :iv1 FROM datetbl WHERE d=:date1; PGTYPEStimestamp_add_interval(&ts1, &iv1, &tsout); out = PGTYPEStimestamp_to_asc(&tsout); printf("Started + duration: %s\n", out); PGTYPESchar_free(out);
Some functions such as PGTYPESnumeric_to_asc
return
a pointer to a freshly allocated character string. These results should be
freed with PGTYPESchar_free
instead of
free
. (This is important only on Windows, where
memory allocation and release sometimes need to be done by the same
library.)
The numeric type offers to do calculations with arbitrary precision. See
Section 8.1 for the equivalent type in the
PostgreSQL server. Because of the arbitrary precision this
variable needs to be able to expand and shrink dynamically. That's why you
can only create numeric variables on the heap, by means of the
PGTYPESnumeric_new
and PGTYPESnumeric_free
functions. The decimal type, which is similar but limited in precision,
can be created on the stack as well as on the heap.
The following functions can be used to work with the numeric type:
PGTYPESnumeric_new
Request a pointer to a newly allocated numeric variable.
numeric *PGTYPESnumeric_new(void);
PGTYPESnumeric_free
Free a numeric type, release all of its memory.
void PGTYPESnumeric_free(numeric *var);
PGTYPESnumeric_from_asc
Parse a numeric type from its string notation.
numeric *PGTYPESnumeric_from_asc(char *str, char **endptr);
Valid formats are for example:
-2
,
.794
,
+3.44
,
592.49E07
or
-32.84e-4
.
If the value could be parsed successfully, a valid pointer is returned,
else the NULL pointer. At the moment ECPG always parses the complete
string and so it currently does not support to store the address of the
first invalid character in *endptr
. You can safely
set endptr
to NULL.
PGTYPESnumeric_to_asc
Returns a pointer to a string allocated by malloc
that contains the string
representation of the numeric type num
.
char *PGTYPESnumeric_to_asc(numeric *num, int dscale);
The numeric value will be printed with dscale
decimal
digits, with rounding applied if necessary.
The result must be freed with PGTYPESchar_free()
.
PGTYPESnumeric_add
Add two numeric variables into a third one.
int PGTYPESnumeric_add(numeric *var1, numeric *var2, numeric *result);
The function adds the variables var1
and
var2
into the result variable
result
.
The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_sub
Subtract two numeric variables and return the result in a third one.
int PGTYPESnumeric_sub(numeric *var1, numeric *var2, numeric *result);
The function subtracts the variable var2
from
the variable var1
. The result of the operation is
stored in the variable result
.
The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_mul
Multiply two numeric variables and return the result in a third one.
int PGTYPESnumeric_mul(numeric *var1, numeric *var2, numeric *result);
The function multiplies the variables var1
and
var2
. The result of the operation is stored in the
variable result
.
The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_div
Divide two numeric variables and return the result in a third one.
int PGTYPESnumeric_div(numeric *var1, numeric *var2, numeric *result);
The function divides the variables var1
by
var2
. The result of the operation is stored in the
variable result
.
The function returns 0 on success and -1 in case of error.
PGTYPESnumeric_cmp
Compare two numeric variables.
int PGTYPESnumeric_cmp(numeric *var1, numeric *var2)
This function compares two numeric variables. In case of error,
INT_MAX
is returned. On success, the function
returns one of three possible results:
1, if var1
is bigger than var2
-1, if var1
is smaller than var2
0, if var1
and var2
are equal
PGTYPESnumeric_from_int
Convert an int variable to a numeric variable.
int PGTYPESnumeric_from_int(signed int int_val, numeric *var);
This function accepts a variable of type signed int and stores it
in the numeric variable var
. Upon success, 0 is returned and
-1 in case of a failure.
PGTYPESnumeric_from_long
Convert a long int variable to a numeric variable.
int PGTYPESnumeric_from_long(signed long int long_val, numeric *var);
This function accepts a variable of type signed long int and stores it
in the numeric variable var
. Upon success, 0 is returned and
-1 in case of a failure.
PGTYPESnumeric_copy
Copy over one numeric variable into another one.
int PGTYPESnumeric_copy(numeric *src, numeric *dst);
This function copies over the value of the variable that
src
points to into the variable that dst
points to. It returns 0 on success and -1 if an error occurs.
PGTYPESnumeric_from_double
Convert a variable of type double to a numeric.
int PGTYPESnumeric_from_double(double d, numeric *dst);
This function accepts a variable of type double and stores the result
in the variable that dst
points to. It returns 0 on success
and -1 if an error occurs.
PGTYPESnumeric_to_double
Convert a variable of type numeric to double.
int PGTYPESnumeric_to_double(numeric *nv, double *dp)
The function converts the numeric value from the variable that
nv
points to into the double variable that dp
points
to. It returns 0 on success and -1 if an error occurs, including
overflow. On overflow, the global variable errno
will be set
to PGTYPES_NUM_OVERFLOW
additionally.
PGTYPESnumeric_to_int
Convert a variable of type numeric to int.
int PGTYPESnumeric_to_int(numeric *nv, int *ip);
The function converts the numeric value from the variable that
nv
points to into the integer variable that ip
points to. It returns 0 on success and -1 if an error occurs, including
overflow. On overflow, the global variable errno
will be set
to PGTYPES_NUM_OVERFLOW
additionally.
PGTYPESnumeric_to_long
Convert a variable of type numeric to long.
int PGTYPESnumeric_to_long(numeric *nv, long *lp);
The function converts the numeric value from the variable that
nv
points to into the long integer variable that
lp
points to. It returns 0 on success and -1 if an error
occurs, including overflow. On overflow, the global variable
errno
will be set to PGTYPES_NUM_OVERFLOW
additionally.
PGTYPESnumeric_to_decimal
Convert a variable of type numeric to decimal.
int PGTYPESnumeric_to_decimal(numeric *src, decimal *dst);
The function converts the numeric value from the variable that
src
points to into the decimal variable that
dst
points to. It returns 0 on success and -1 if an error
occurs, including overflow. On overflow, the global variable
errno
will be set to PGTYPES_NUM_OVERFLOW
additionally.
PGTYPESnumeric_from_decimal
Convert a variable of type decimal to numeric.
int PGTYPESnumeric_from_decimal(decimal *src, numeric *dst);
The function converts the decimal value from the variable that
src
points to into the numeric variable that
dst
points to. It returns 0 on success and -1 if an error
occurs. Since the decimal type is implemented as a limited version of
the numeric type, overflow cannot occur with this conversion.
The date type in C enables your programs to deal with data of the SQL type date. See Section 8.5 for the equivalent type in the PostgreSQL server.
The following functions can be used to work with the date type:
PGTYPESdate_from_timestamp
Extract the date part from a timestamp.
date PGTYPESdate_from_timestamp(timestamp dt);
The function receives a timestamp as its only argument and returns the extracted date part from this timestamp.
PGTYPESdate_from_asc
Parse a date from its textual representation.
date PGTYPESdate_from_asc(char *str, char **endptr);
The function receives a C char* string str
and a pointer to
a C char* string endptr
. At the moment ECPG always parses
the complete string and so it currently does not support to store the
address of the first invalid character in *endptr
.
You can safely set endptr
to NULL.
Note that the function always assumes MDY-formatted dates and there is currently no variable to change that within ECPG.
Table 36.2 shows the allowed input formats.
Table 36.2. Valid Input Formats for PGTYPESdate_from_asc
Input | Result |
---|---|
January 8, 1999 | January 8, 1999 |
1999-01-08 | January 8, 1999 |
1/8/1999 | January 8, 1999 |
1/18/1999 | January 18, 1999 |
01/02/03 | February 1, 2003 |
1999-Jan-08 | January 8, 1999 |
Jan-08-1999 | January 8, 1999 |
08-Jan-1999 | January 8, 1999 |
99-Jan-08 | January 8, 1999 |
08-Jan-99 | January 8, 1999 |
08-Jan-06 | January 8, 2006 |
Jan-08-99 | January 8, 1999 |
19990108 | ISO 8601; January 8, 1999 |
990108 | ISO 8601; January 8, 1999 |
1999.008 | year and day of year |
J2451187 | Julian day |
January 8, 99 BC | year 99 before the Common Era |
PGTYPESdate_to_asc
Return the textual representation of a date variable.
char *PGTYPESdate_to_asc(date dDate);
The function receives the date dDate
as its only parameter.
It will output the date in the form 1999-01-18
, i.e., in the
YYYY-MM-DD
format.
The result must be freed with PGTYPESchar_free()
.
PGTYPESdate_julmdy
Extract the values for the day, the month and the year from a variable of type date.
void PGTYPESdate_julmdy(date d, int *mdy);
The function receives the date d
and a pointer to an array
of 3 integer values mdy
. The variable name indicates
the sequential order: mdy[0]
will be set to contain the
number of the month, mdy[1]
will be set to the value of the
day and mdy[2]
will contain the year.
PGTYPESdate_mdyjul
Create a date value from an array of 3 integers that specify the day, the month and the year of the date.
void PGTYPESdate_mdyjul(int *mdy, date *jdate);
The function receives the array of the 3 integers (mdy
) as
its first argument and as its second argument a pointer to a variable
of type date that should hold the result of the operation.
PGTYPESdate_dayofweek
Return a number representing the day of the week for a date value.
int PGTYPESdate_dayofweek(date d);
The function receives the date variable d
as its only
argument and returns an integer that indicates the day of the week for
this date.
0 - Sunday
1 - Monday
2 - Tuesday
3 - Wednesday
4 - Thursday
5 - Friday
6 - Saturday
PGTYPESdate_today
Get the current date.
void PGTYPESdate_today(date *d);
The function receives a pointer to a date variable (d
)
that it sets to the current date.
PGTYPESdate_fmt_asc
Convert a variable of type date to its textual representation using a format mask.
int PGTYPESdate_fmt_asc(date dDate, char *fmtstring, char *outbuf);
The function receives the date to convert (dDate
), the
format mask (fmtstring
) and the string that will hold the
textual representation of the date (outbuf
).
On success, 0 is returned and a negative value if an error occurred.
The following literals are the field specifiers you can use:
dd
- The number of the day of the month.
mm
- The number of the month of the year.
yy
- The number of the year as a two digit number.
yyyy
- The number of the year as a four digit number.
ddd
- The name of the day (abbreviated).
mmm
- The name of the month (abbreviated).
All other characters are copied 1:1 to the output string.
Table 36.3 indicates a few possible formats. This will give you an idea of how to use this function. All output lines are based on the same date: November 23, 1959.
Table 36.3. Valid Input Formats for PGTYPESdate_fmt_asc
Format | Result |
---|---|
mmddyy | 112359 |
ddmmyy | 231159 |
yymmdd | 591123 |
yy/mm/dd | 59/11/23 |
yy mm dd | 59 11 23 |
yy.mm.dd | 59.11.23 |
.mm.yyyy.dd. | .11.1959.23. |
mmm. dd, yyyy | Nov. 23, 1959 |
mmm dd yyyy | Nov 23 1959 |
yyyy dd mm | 1959 23 11 |
ddd, mmm. dd, yyyy | Mon, Nov. 23, 1959 |
(ddd) mmm. dd, yyyy | (Mon) Nov. 23, 1959 |
PGTYPESdate_defmt_asc
Use a format mask to convert a C char*
string to a value of type
date.
int PGTYPESdate_defmt_asc(date *d, char *fmt, char *str);
The function receives a pointer to the date value that should hold the
result of the operation (d
), the format mask to use for
parsing the date (fmt
) and the C char* string containing
the textual representation of the date (str
). The textual
representation is expected to match the format mask. However you do not
need to have a 1:1 mapping of the string to the format mask. The
function only analyzes the sequential order and looks for the literals
yy
or yyyy
that indicate the
position of the year, mm
to indicate the position of
the month and dd
to indicate the position of the
day.
Table 36.4 indicates a few possible formats. This will give you an idea of how to use this function.
Table 36.4. Valid Input Formats for rdefmtdate
Format | String | Result |
---|---|---|
ddmmyy | 21-2-54 | 1954-02-21 |
ddmmyy | 2-12-54 | 1954-12-02 |
ddmmyy | 20111954 | 1954-11-20 |
ddmmyy | 130464 | 1964-04-13 |
mmm.dd.yyyy | MAR-12-1967 | 1967-03-12 |
yy/mm/dd | 1954, February 3rd | 1954-02-03 |
mmm.dd.yyyy | 041269 | 1969-04-12 |
yy/mm/dd | In the year 2525, in the month of July, mankind will be alive on the 28th day | 2525-07-28 |
dd-mm-yy | I said on the 28th of July in the year 2525 | 2525-07-28 |
mmm.dd.yyyy | 9/14/58 | 1958-09-14 |
yy/mm/dd | 47/03/29 | 1947-03-29 |
mmm.dd.yyyy | oct 28 1975 | 1975-10-28 |
mmddyy | Nov 14th, 1985 | 1985-11-14 |
The timestamp type in C enables your programs to deal with data of the SQL type timestamp. See Section 8.5 for the equivalent type in the PostgreSQL server.
The following functions can be used to work with the timestamp type:
PGTYPEStimestamp_from_asc
Parse a timestamp from its textual representation into a timestamp variable.
timestamp PGTYPEStimestamp_from_asc(char *str, char **endptr);
The function receives the string to parse (str
) and a
pointer to a C char* (endptr
).
At the moment ECPG always parses
the complete string and so it currently does not support to store the
address of the first invalid character in *endptr
.
You can safely set endptr
to NULL.
The function returns the parsed timestamp on success. On error,
PGTYPESInvalidTimestamp
is returned and errno
is
set to PGTYPES_TS_BAD_TIMESTAMP
. See PGTYPESInvalidTimestamp
for important notes on this value.
In general, the input string can contain any combination of an allowed date specification, a whitespace character and an allowed time specification. Note that time zones are not supported by ECPG. It can parse them but does not apply any calculation as the PostgreSQL server does for example. Timezone specifiers are silently discarded.
Table 36.5 contains a few examples for input strings.
Table 36.5. Valid Input Formats for PGTYPEStimestamp_from_asc
Input | Result |
---|---|
1999-01-08 04:05:06 | 1999-01-08 04:05:06 |
January 8 04:05:06 1999 PST | 1999-01-08 04:05:06 |
1999-Jan-08 04:05:06.789-8 | 1999-01-08 04:05:06.789 (time zone specifier ignored) |
J2451187 04:05-08:00 | 1999-01-08 04:05:00 (time zone specifier ignored) |
PGTYPEStimestamp_to_asc
Converts a date to a C char* string.
char *PGTYPEStimestamp_to_asc(timestamp tstamp);
The function receives the timestamp tstamp
as
its only argument and returns an allocated string that contains the
textual representation of the timestamp.
The result must be freed with PGTYPESchar_free()
.
PGTYPEStimestamp_current
Retrieve the current timestamp.
void PGTYPEStimestamp_current(timestamp *ts);
The function retrieves the current timestamp and saves it into the
timestamp variable that ts
points to.
PGTYPEStimestamp_fmt_asc
Convert a timestamp variable to a C char* using a format mask.
int PGTYPEStimestamp_fmt_asc(timestamp *ts, char *output, int str_len, char *fmtstr);
The function receives a pointer to the timestamp to convert as its
first argument (ts
), a pointer to the output buffer
(output
), the maximal length that has been allocated for
the output buffer (str_len
) and the format mask to
use for the conversion (fmtstr
).
Upon success, the function returns 0 and a negative value if an error occurred.
You can use the following format specifiers for the format mask. The
format specifiers are the same ones that are used in the
strftime
function in libc. Any
non-format specifier will be copied into the output buffer.
%A
- is replaced by national representation of
the full weekday name.
%a
- is replaced by national representation of
the abbreviated weekday name.
%B
- is replaced by national representation of
the full month name.
%b
- is replaced by national representation of
the abbreviated month name.
%C
- is replaced by (year / 100) as decimal
number; single digits are preceded by a zero.
%c
- is replaced by national representation of
time and date.
%D
- is equivalent to
%m/%d/%y
.
%d
- is replaced by the day of the month as a
decimal number (01–31).
%E*
%O*
- POSIX locale
extensions. The sequences
%Ec
%EC
%Ex
%EX
%Ey
%EY
%Od
%Oe
%OH
%OI
%Om
%OM
%OS
%Ou
%OU
%OV
%Ow
%OW
%Oy
are supposed to provide alternative representations.
Additionally %OB
implemented to represent
alternative months names (used standalone, without day mentioned).
%e
- is replaced by the day of month as a decimal
number (1–31); single digits are preceded by a blank.
%F
- is equivalent to %Y-%m-%d
.
%G
- is replaced by a year as a decimal number
with century. This year is the one that contains the greater part of
the week (Monday as the first day of the week).
%g
- is replaced by the same year as in
%G
, but as a decimal number without century
(00–99).
%H
- is replaced by the hour (24-hour clock) as a
decimal number (00–23).
%h
- the same as %b
.
%I
- is replaced by the hour (12-hour clock) as a
decimal number (01–12).
%j
- is replaced by the day of the year as a
decimal number (001–366).
%k
- is replaced by the hour (24-hour clock) as a
decimal number (0–23); single digits are preceded by a blank.
%l
- is replaced by the hour (12-hour clock) as a
decimal number (1–12); single digits are preceded by a blank.
%M
- is replaced by the minute as a decimal
number (00–59).
%m
- is replaced by the month as a decimal number
(01–12).
%n
- is replaced by a newline.
%O*
- the same as %E*
.
%p
- is replaced by national representation of
either “ante meridiem” or “post meridiem” as appropriate.
%R
- is equivalent to %H:%M
.
%r
- is equivalent to %I:%M:%S
%p
.
%S
- is replaced by the second as a decimal
number (00–60).
%s
- is replaced by the number of seconds since
the Epoch, UTC.
%T
- is equivalent to %H:%M:%S
%t
- is replaced by a tab.
%U
- is replaced by the week number of the year
(Sunday as the first day of the week) as a decimal number (00–53).
%u
- is replaced by the weekday (Monday as the
first day of the week) as a decimal number (1–7).
%V
- is replaced by the week number of the year
(Monday as the first day of the week) as a decimal number (01–53).
If the week containing January 1 has four or more days in the new
year, then it is week 1; otherwise it is the last week of the
previous year, and the next week is week 1.
%v
- is equivalent to
%e-%b-%Y
.
%W
- is replaced by the week number of the year
(Monday as the first day of the week) as a decimal number (00–53).
%w
- is replaced by the weekday (Sunday as the
first day of the week) as a decimal number (0–6).
%X
- is replaced by national representation of
the time.
%x
- is replaced by national representation of
the date.
%Y
- is replaced by the year with century as a
decimal number.
%y
- is replaced by the year without century as a
decimal number (00–99).
%Z
- is replaced by the time zone name.
%z
- is replaced by the time zone offset from
UTC; a leading plus sign stands for east of UTC, a minus sign for
west of UTC, hours and minutes follow with two digits each and no
delimiter between them (common form for RFC 822 date headers).
%+
- is replaced by national representation of
the date and time.
%-*
- GNU libc extension. Do not do any padding
when performing numerical outputs.
$_* - GNU libc extension. Explicitly specify space for padding.
%0*
- GNU libc extension. Explicitly specify zero
for padding.
%%
- is replaced by %
.
PGTYPEStimestamp_sub
Subtract one timestamp from another one and save the result in a variable of type interval.
int PGTYPEStimestamp_sub(timestamp *ts1, timestamp *ts2, interval *iv);
The function will subtract the timestamp variable that ts2
points to from the timestamp variable that ts1
points to
and will store the result in the interval variable that iv
points to.
Upon success, the function returns 0 and a negative value if an error occurred.
PGTYPEStimestamp_defmt_asc
Parse a timestamp value from its textual representation using a formatting mask.
int PGTYPEStimestamp_defmt_asc(char *str, char *fmt, timestamp *d);
The function receives the textual representation of a timestamp in the
variable str
as well as the formatting mask to use in the
variable fmt
. The result will be stored in the variable
that d
points to.
If the formatting mask fmt
is NULL, the function will fall
back to the default formatting mask which is %Y-%m-%d
%H:%M:%S
.
This is the reverse function to PGTYPEStimestamp_fmt_asc
. See the documentation there in
order to find out about the possible formatting mask entries.
PGTYPEStimestamp_add_interval
Add an interval variable to a timestamp variable.
int PGTYPEStimestamp_add_interval(timestamp *tin, interval *span, timestamp *tout);
The function receives a pointer to a timestamp variable tin
and a pointer to an interval variable span
. It adds the
interval to the timestamp and saves the resulting timestamp in the
variable that tout
points to.
Upon success, the function returns 0 and a negative value if an error occurred.
PGTYPEStimestamp_sub_interval
Subtract an interval variable from a timestamp variable.
int PGTYPEStimestamp_sub_interval(timestamp *tin, interval *span, timestamp *tout);
The function subtracts the interval variable that span
points to from the timestamp variable that tin
points to
and saves the result into the variable that tout
points
to.
Upon success, the function returns 0 and a negative value if an error occurred.
The interval type in C enables your programs to deal with data of the SQL type interval. See Section 8.5 for the equivalent type in the PostgreSQL server.
The following functions can be used to work with the interval type:
PGTYPESinterval_new
Return a pointer to a newly allocated interval variable.
interval *PGTYPESinterval_new(void);
PGTYPESinterval_free
Release the memory of a previously allocated interval variable.
void PGTYPESinterval_free(interval *intvl);
PGTYPESinterval_from_asc
Parse an interval from its textual representation.
interval *PGTYPESinterval_from_asc(char *str, char **endptr);
The function parses the input string str
and returns a
pointer to an allocated interval variable.
At the moment ECPG always parses
the complete string and so it currently does not support to store the
address of the first invalid character in *endptr
.
You can safely set endptr
to NULL.
PGTYPESinterval_to_asc
Convert a variable of type interval to its textual representation.
char *PGTYPESinterval_to_asc(interval *span);
The function converts the interval variable that span
points to into a C char*. The output looks like this example:
@ 1 day 12 hours 59 mins 10 secs
.
The result must be freed with PGTYPESchar_free()
.
PGTYPESinterval_copy
Copy a variable of type interval.
int PGTYPESinterval_copy(interval *intvlsrc, interval *intvldest);
The function copies the interval variable that intvlsrc
points to into the variable that intvldest
points to. Note
that you need to allocate the memory for the destination variable
before.
The decimal type is similar to the numeric type. However it is limited to
a maximum precision of 30 significant digits. In contrast to the numeric
type which can be created on the heap only, the decimal type can be
created either on the stack or on the heap (by means of the functions
PGTYPESdecimal_new
and
PGTYPESdecimal_free
).
There are a lot of other functions that deal with the decimal type in the
Informix compatibility mode described in Section 36.15.
The following functions can be used to work with the decimal type and are
not only contained in the libcompat
library.
PGTYPESdecimal_new
Request a pointer to a newly allocated decimal variable.
decimal *PGTYPESdecimal_new(void);
PGTYPESdecimal_free
Free a decimal type, release all of its memory.
void PGTYPESdecimal_free(decimal *var);
PGTYPES_NUM_BAD_NUMERIC
An argument should contain a numeric variable (or point to a numeric variable) but in fact its in-memory representation was invalid.
PGTYPES_NUM_OVERFLOW
An overflow occurred. Since the numeric type can deal with almost arbitrary precision, converting a numeric variable into other types might cause overflow.
PGTYPES_NUM_UNDERFLOW
An underflow occurred. Since the numeric type can deal with almost arbitrary precision, converting a numeric variable into other types might cause underflow.
PGTYPES_NUM_DIVIDE_ZERO
A division by zero has been attempted.
PGTYPES_DATE_BAD_DATE
An invalid date string was passed to
the PGTYPESdate_from_asc
function.
PGTYPES_DATE_ERR_EARGS
Invalid arguments were passed to the
PGTYPESdate_defmt_asc
function.
PGTYPES_DATE_ERR_ENOSHORTDATE
An invalid token in the input string was found by the
PGTYPESdate_defmt_asc
function.
PGTYPES_INTVL_BAD_INTERVAL
An invalid interval string was passed to the
PGTYPESinterval_from_asc
function, or an
invalid interval value was passed to the
PGTYPESinterval_to_asc
function.
PGTYPES_DATE_ERR_ENOTDMY
There was a mismatch in the day/month/year assignment in the
PGTYPESdate_defmt_asc
function.
PGTYPES_DATE_BAD_DAY
An invalid day of the month value was found by
the PGTYPESdate_defmt_asc
function.
PGTYPES_DATE_BAD_MONTH
An invalid month value was found by
the PGTYPESdate_defmt_asc
function.
PGTYPES_TS_BAD_TIMESTAMP
An invalid timestamp string pass passed to
the PGTYPEStimestamp_from_asc
function,
or an invalid timestamp value was passed to
the PGTYPEStimestamp_to_asc
function.
PGTYPES_TS_ERR_EINFTIME
An infinite timestamp value was encountered in a context that cannot handle it.
PGTYPESInvalidTimestamp
A value of type timestamp representing an invalid time stamp. This is
returned by the function PGTYPEStimestamp_from_asc
on
parse error.
Note that due to the internal representation of the timestamp
data type,
PGTYPESInvalidTimestamp
is also a valid timestamp at
the same time. It is set to 1899-12-31 23:59:59
. In order
to detect errors, make sure that your application does not only test
for PGTYPESInvalidTimestamp
but also for
errno != 0
after each call to
PGTYPEStimestamp_from_asc
.
An SQL descriptor area is a more sophisticated method for processing
the result of a SELECT
, FETCH
or
a DESCRIBE
statement. An SQL descriptor area groups
the data of one row of data together with metadata items into one
data structure. The metadata is particularly useful when executing
dynamic SQL statements, where the nature of the result columns might
not be known ahead of time. PostgreSQL provides two ways to use
Descriptor Areas: the named SQL Descriptor Areas and the C-structure
SQLDAs.
A named SQL descriptor area consists of a header, which contains information concerning the entire descriptor, and one or more item descriptor areas, which basically each describe one column in the result row.
Before you can use an SQL descriptor area, you need to allocate one:
EXEC SQL ALLOCATE DESCRIPTOR identifier
;
The identifier serves as the “variable name” of the descriptor area. When you don't need the descriptor anymore, you should deallocate it:
EXEC SQL DEALLOCATE DESCRIPTOR identifier
;
To use a descriptor area, specify it as the storage target in an
INTO
clause, instead of listing host variables:
EXEC SQL FETCH NEXT FROM mycursor INTO SQL DESCRIPTOR mydesc;
If the result set is empty, the Descriptor Area will still contain the metadata from the query, i.e., the field names.
For not yet executed prepared queries, the DESCRIBE
statement can be used to get the metadata of the result set:
EXEC SQL BEGIN DECLARE SECTION; char *sql_stmt = "SELECT * FROM table1"; EXEC SQL END DECLARE SECTION; EXEC SQL PREPARE stmt1 FROM :sql_stmt; EXEC SQL DESCRIBE stmt1 INTO SQL DESCRIPTOR mydesc;
Before PostgreSQL 9.0, the SQL
keyword was optional,
so using DESCRIPTOR
and SQL DESCRIPTOR
produced named SQL Descriptor Areas. Now it is mandatory, omitting
the SQL
keyword produces SQLDA Descriptor Areas,
see Section 36.7.2.
In DESCRIBE
and FETCH
statements,
the INTO
and USING
keywords can be
used to similarly: they produce the result set and the metadata in a
Descriptor Area.
Now how do you get the data out of the descriptor area? You can think of the descriptor area as a structure with named fields. To retrieve the value of a field from the header and store it into a host variable, use the following command:
EXEC SQL GET DESCRIPTORname
:hostvar
=field
;
Currently, there is only one header field defined:
COUNT
, which tells how many item
descriptor areas exist (that is, how many columns are contained in
the result). The host variable needs to be of an integer type. To
get a field from the item descriptor area, use the following
command:
EXEC SQL GET DESCRIPTORname
VALUEnum
:hostvar
=field
;
num
can be a literal integer or a host
variable containing an integer. Possible fields are:
CARDINALITY
(integer)number of rows in the result set
DATA
actual data item (therefore, the data type of this field depends on the query)
DATETIME_INTERVAL_CODE
(integer)
When TYPE
is 9
,
DATETIME_INTERVAL_CODE
will have a value of
1
for DATE
,
2
for TIME
,
3
for TIMESTAMP
,
4
for TIME WITH TIME ZONE
, or
5
for TIMESTAMP WITH TIME ZONE
.
DATETIME_INTERVAL_PRECISION
(integer)not implemented
INDICATOR
(integer)the indicator (indicating a null value or a value truncation)
KEY_MEMBER
(integer)not implemented
LENGTH
(integer)length of the datum in characters
NAME
(string)name of the column
NULLABLE
(integer)not implemented
OCTET_LENGTH
(integer)length of the character representation of the datum in bytes
PRECISION
(integer)
precision (for type numeric
)
RETURNED_LENGTH
(integer)length of the datum in characters
RETURNED_OCTET_LENGTH
(integer)length of the character representation of the datum in bytes
SCALE
(integer)
scale (for type numeric
)
TYPE
(integer)numeric code of the data type of the column
In EXECUTE
, DECLARE
and OPEN
statements, the effect of the INTO
and USING
keywords are different. A Descriptor Area can also be manually built to
provide the input parameters for a query or a cursor and
USING SQL DESCRIPTOR
is the way to pass the input parameters into a parameterized query. The statement
to build a named SQL Descriptor Area is below:
name
EXEC SQL SET DESCRIPTORname
VALUEnum
field
= :hostvar
;
PostgreSQL supports retrieving more that one record in one FETCH
statement and storing the data in host variables in this case assumes that the
variable is an array. E.g.:
EXEC SQL BEGIN DECLARE SECTION; int id[5]; EXEC SQL END DECLARE SECTION; EXEC SQL FETCH 5 FROM mycursor INTO SQL DESCRIPTOR mydesc; EXEC SQL GET DESCRIPTOR mydesc VALUE 1 :id = DATA;
An SQLDA Descriptor Area is a C language structure which can be also used to get the result set and the metadata of a query. One structure stores one record from the result set.
EXEC SQL include sqlda.h; sqlda_t *mysqlda; EXEC SQL FETCH 3 FROM mycursor INTO DESCRIPTOR mysqlda;
Note that the SQL
keyword is omitted. The paragraphs about
the use cases of the INTO
and USING
keywords in Section 36.7.1 also apply here with an addition.
In a DESCRIBE
statement the DESCRIPTOR
keyword can be completely omitted if the INTO
keyword is used:
EXEC SQL DESCRIBE prepared_statement INTO mysqlda;
The general flow of a program that uses SQLDA is:
Prepare a query, and declare a cursor for it.
Declare an SQLDA for the result rows.
Declare an SQLDA for the input parameters, and initialize them (memory allocation, parameter settings).
Open a cursor with the input SQLDA.
Fetch rows from the cursor, and store them into an output SQLDA.
Read values from the output SQLDA into the host variables (with conversion if necessary).
Close the cursor.
Free the memory area allocated for the input SQLDA.
SQLDA uses three data structure
types: sqlda_t
, sqlvar_t
,
and struct sqlname
.
PostgreSQL's SQLDA has a similar data structure to the one in IBM DB2 Universal Database, so some technical information on DB2's SQLDA could help understanding PostgreSQL's one better.
The structure type sqlda_t
is the type of the
actual SQLDA. It holds one record. And two or
more sqlda_t
structures can be connected in a
linked list with the pointer in
the desc_next
field, thus
representing an ordered collection of rows. So, when two or
more rows are fetched, the application can read them by
following the desc_next
pointer in
each sqlda_t
node.
The definition of sqlda_t
is:
struct sqlda_struct { char sqldaid[8]; long sqldabc; short sqln; short sqld; struct sqlda_struct *desc_next; struct sqlvar_struct sqlvar[1]; }; typedef struct sqlda_struct sqlda_t;
The meaning of the fields is:
sqldaid
It contains the literal string "SQLDA "
.
sqldabc
It contains the size of the allocated space in bytes.
sqln
It contains the number of input parameters for a parameterized query in
case it's passed into OPEN
, DECLARE
or
EXECUTE
statements using the USING
keyword. In case it's used as output of SELECT
,
EXECUTE
or FETCH
statements,
its value is the same as sqld
statement
sqld
It contains the number of fields in a result set.
desc_next
If the query returns more than one record, multiple linked
SQLDA structures are returned, and desc_next
holds
a pointer to the next entry in the list.
sqlvar
This is the array of the columns in the result set.
The structure type sqlvar_t
holds a column value
and metadata such as type and length. The definition of the type
is:
struct sqlvar_struct { short sqltype; short sqllen; char *sqldata; short *sqlind; struct sqlname sqlname; }; typedef struct sqlvar_struct sqlvar_t;
The meaning of the fields is:
sqltype
Contains the type identifier of the field. For values,
see enum ECPGttype
in ecpgtype.h
.
sqllen
Contains the binary length of the field. e.g., 4 bytes for ECPGt_int
.
sqldata
Points to the data. The format of the data is described in Section 36.4.4.
sqlind
Points to the null indicator. 0 means not null, -1 means null.
sqlname
The name of the field.
A struct sqlname
structure holds a column name. It
is used as a member of the sqlvar_t
structure. The
definition of the structure is:
#define NAMEDATALEN 64 struct sqlname { short length; char data[NAMEDATALEN]; };
The meaning of the fields is:
length
Contains the length of the field name.
data
Contains the actual field name.
The general steps to retrieve a query result set through an SQLDA are:
Declare an sqlda_t
structure to receive the result set.
Execute FETCH
/EXECUTE
/DESCRIBE
commands to process a query specifying the declared SQLDA.
Check the number of records in the result set by looking at sqln
, a member of the sqlda_t
structure.
Get the values of each column from sqlvar[0]
, sqlvar[1]
, etc., members of the sqlda_t
structure.
Go to next row (sqlda_t
structure) by following the desc_next
pointer, a member of the sqlda_t
structure.
Repeat above as you need.
Here is an example retrieving a result set through an SQLDA.
First, declare a sqlda_t
structure to receive the result set.
sqlda_t *sqlda1;
Next, specify the SQLDA in a command. This is
a FETCH
command example.
EXEC SQL FETCH NEXT FROM cur1 INTO DESCRIPTOR sqlda1;
Run a loop following the linked list to retrieve the rows.
sqlda_t *cur_sqlda; for (cur_sqlda = sqlda1; cur_sqlda != NULL; cur_sqlda = cur_sqlda->desc_next) { ... }
Inside the loop, run another loop to retrieve each column data
(sqlvar_t
structure) of the row.
for (i = 0; i < cur_sqlda->sqld; i++) { sqlvar_t v = cur_sqlda->sqlvar[i]; char *sqldata = v.sqldata; short sqllen = v.sqllen; ... }
To get a column value, check the sqltype
value,
a member of the sqlvar_t
structure. Then, switch
to an appropriate way, depending on the column type, to copy
data from the sqlvar
field to a host variable.
char var_buf[1024]; switch (v.sqltype) { case ECPGt_char: memset(&var_buf, 0, sizeof(var_buf)); memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf) - 1 : sqllen)); break; case ECPGt_int: /* integer */ memcpy(&intval, sqldata, sqllen); snprintf(var_buf, sizeof(var_buf), "%d", intval); break; ... }
The general steps to use an SQLDA to pass input parameters to a prepared query are:
Create a prepared query (prepared statement)
Declare an sqlda_t structure as an input SQLDA.
Allocate memory area (as sqlda_t structure) for the input SQLDA.
Set (copy) input values in the allocated memory.
Open a cursor with specifying the input SQLDA.
Here is an example.
First, create a prepared statement.
EXEC SQL BEGIN DECLARE SECTION; char query[1024] = "SELECT d.oid, * FROM pg_database d, pg_stat_database s WHERE d.oid = s.datid AND (d.datname = ? OR d.oid = ?)"; EXEC SQL END DECLARE SECTION; EXEC SQL PREPARE stmt1 FROM :query;
Next, allocate memory for an SQLDA, and set the number of input
parameters in sqln
, a member variable of
the sqlda_t
structure. When two or more input
parameters are required for the prepared query, the application
has to allocate additional memory space which is calculated by
(nr. of params - 1) * sizeof(sqlvar_t). The example shown here
allocates memory space for two input parameters.
sqlda_t *sqlda2; sqlda2 = (sqlda_t *) malloc(sizeof(sqlda_t) + sizeof(sqlvar_t)); memset(sqlda2, 0, sizeof(sqlda_t) + sizeof(sqlvar_t)); sqlda2->sqln = 2; /* number of input variables */
After memory allocation, store the parameter values into the
sqlvar[]
array. (This is same array used for
retrieving column values when the SQLDA is receiving a result
set.) In this example, the input parameters
are "postgres"
, having a string type,
and 1
, having an integer type.
sqlda2->sqlvar[0].sqltype = ECPGt_char; sqlda2->sqlvar[0].sqldata = "postgres"; sqlda2->sqlvar[0].sqllen = 8; int intval = 1; sqlda2->sqlvar[1].sqltype = ECPGt_int; sqlda2->sqlvar[1].sqldata = (char *) &intval; sqlda2->sqlvar[1].sqllen = sizeof(intval);
By opening a cursor and specifying the SQLDA that was set up beforehand, the input parameters are passed to the prepared statement.
EXEC SQL OPEN cur1 USING DESCRIPTOR sqlda2;
Finally, after using input SQLDAs, the allocated memory space must be freed explicitly, unlike SQLDAs used for receiving query results.
free(sqlda2);
Here is an example program, which describes how to fetch access statistics of the databases, specified by the input parameters, from the system catalogs.
This application joins two system tables, pg_database and
pg_stat_database on the database OID, and also fetches and shows
the database statistics which are retrieved by two input
parameters (a database postgres
, and OID 1
).
First, declare an SQLDA for input and an SQLDA for output.
EXEC SQL include sqlda.h; sqlda_t *sqlda1; /* an output descriptor */ sqlda_t *sqlda2; /* an input descriptor */
Next, connect to the database, prepare a statement, and declare a cursor for the prepared statement.
int main(void) { EXEC SQL BEGIN DECLARE SECTION; char query[1024] = "SELECT d.oid,* FROM pg_database d, pg_stat_database s WHERE d.oid=s.datid AND ( d.datname=? OR d.oid=? )"; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO testdb AS con1 USER testuser; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL PREPARE stmt1 FROM :query; EXEC SQL DECLARE cur1 CURSOR FOR stmt1;
Next, put some values in the input SQLDA for the input
parameters. Allocate memory for the input SQLDA, and set the
number of input parameters to sqln
. Store
type, value, and value length into sqltype
,
sqldata
, and sqllen
in the
sqlvar
structure.
/* Create SQLDA structure for input parameters. */ sqlda2 = (sqlda_t *) malloc(sizeof(sqlda_t) + sizeof(sqlvar_t)); memset(sqlda2, 0, sizeof(sqlda_t) + sizeof(sqlvar_t)); sqlda2->sqln = 2; /* number of input variables */ sqlda2->sqlvar[0].sqltype = ECPGt_char; sqlda2->sqlvar[0].sqldata = "postgres"; sqlda2->sqlvar[0].sqllen = 8; intval = 1; sqlda2->sqlvar[1].sqltype = ECPGt_int; sqlda2->sqlvar[1].sqldata = (char *)&intval; sqlda2->sqlvar[1].sqllen = sizeof(intval);
After setting up the input SQLDA, open a cursor with the input SQLDA.
/* Open a cursor with input parameters. */ EXEC SQL OPEN cur1 USING DESCRIPTOR sqlda2;
Fetch rows into the output SQLDA from the opened cursor.
(Generally, you have to call FETCH
repeatedly
in the loop, to fetch all rows in the result set.)
while (1) { sqlda_t *cur_sqlda; /* Assign descriptor to the cursor */ EXEC SQL FETCH NEXT FROM cur1 INTO DESCRIPTOR sqlda1;
Next, retrieve the fetched records from the SQLDA, by following
the linked list of the sqlda_t
structure.
for (cur_sqlda = sqlda1 ; cur_sqlda != NULL ; cur_sqlda = cur_sqlda->desc_next) { ...
Read each columns in the first record. The number of columns is
stored in sqld
, the actual data of the first
column is stored in sqlvar[0]
, both members of
the sqlda_t
structure.
/* Print every column in a row. */ for (i = 0; i < sqlda1->sqld; i++) { sqlvar_t v = sqlda1->sqlvar[i]; char *sqldata = v.sqldata; short sqllen = v.sqllen; strncpy(name_buf, v.sqlname.data, v.sqlname.length); name_buf[v.sqlname.length] = '\0';
Now, the column data is stored in the variable v
.
Copy every datum into host variables, looking
at v.sqltype
for the type of the column.
switch (v.sqltype) { int intval; double doubleval; unsigned long long int longlongval; case ECPGt_char: memset(&var_buf, 0, sizeof(var_buf)); memcpy(&var_buf, sqldata, (sizeof(var_buf) <= sqllen ? sizeof(var_buf)-1 : sqllen)); break; case ECPGt_int: /* integer */ memcpy(&intval, sqldata, sqllen); snprintf(var_buf, sizeof(var_buf), "%d", intval); break; ... default: ... } printf("%s = %s (type: %d)\n", name_buf, var_buf, v.sqltype); }
Close the cursor after processing all of records, and disconnect from the database.
EXEC SQL CLOSE cur1; EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL;
The whole program is shown in Example 36.1.
Example 36.1. Example SQLDA Program
#include <stdlib.h> #include <string.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> EXEC SQL include sqlda.h; sqlda_t *sqlda1; /* descriptor for output */ sqlda_t *sqlda2; /* descriptor for input */ EXEC SQL WHENEVER NOT FOUND DO BREAK; EXEC SQL WHENEVER SQLERROR STOP; int main(void) { EXEC SQL BEGIN DECLARE SECTION; char query[1024] = "SELECT d.oid,* FROM pg_database d, pg_stat_database s WHERE d.oid=s.datid AND ( d.datname=? OR d.oid=? )"; int intval; unsigned long long int longlongval; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO uptimedb AS con1 USER uptime; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL PREPARE stmt1 FROM :query; EXEC SQL DECLARE cur1 CURSOR FOR stmt1; /* Create an SQLDA structure for an input parameter */ sqlda2 = (sqlda_t *)malloc(sizeof(sqlda_t) + sizeof(sqlvar_t)); memset(sqlda2, 0, sizeof(sqlda_t) + sizeof(sqlvar_t)); sqlda2->sqln = 2; /* a number of input variables */ sqlda2->sqlvar[0].sqltype = ECPGt_char; sqlda2->sqlvar[0].sqldata = "postgres"; sqlda2->sqlvar[0].sqllen = 8; intval = 1; sqlda2->sqlvar[1].sqltype = ECPGt_int; sqlda2->sqlvar[1].sqldata = (char *) &intval; sqlda2->sqlvar[1].sqllen = sizeof(intval); /* Open a cursor with input parameters. */ EXEC SQL OPEN cur1 USING DESCRIPTOR sqlda2; while (1) { sqlda_t *cur_sqlda; /* Assign descriptor to the cursor */ EXEC SQL FETCH NEXT FROM cur1 INTO DESCRIPTOR sqlda1; for (cur_sqlda = sqlda1 ; cur_sqlda != NULL ; cur_sqlda = cur_sqlda->desc_next) { int i; char name_buf[1024]; char var_buf[1024]; /* Print every column in a row. */ for (i=0 ; i<cur_sqlda->sqld ; i++) { sqlvar_t v = cur_sqlda->sqlvar[i]; char *sqldata = v.sqldata; short sqllen = v.sqllen; strncpy(name_buf, v.sqlname.data, v.sqlname.length); name_buf[v.sqlname.length] = '\0'; switch (v.sqltype) { case ECPGt_char: memset(&var_buf, 0, sizeof(var_buf)); memcpy(&var_buf, sqldata, (sizeof(var_buf)<=sqllen ? sizeof(var_buf)-1 : sqllen) ); break; case ECPGt_int: /* integer */ memcpy(&intval, sqldata, sqllen); snprintf(var_buf, sizeof(var_buf), "%d", intval); break; case ECPGt_long_long: /* bigint */ memcpy(&longlongval, sqldata, sqllen); snprintf(var_buf, sizeof(var_buf), "%lld", longlongval); break; default: { int i; memset(var_buf, 0, sizeof(var_buf)); for (i = 0; i < sqllen; i++) { char tmpbuf[16]; snprintf(tmpbuf, sizeof(tmpbuf), "%02x ", (unsigned char) sqldata[i]); strncat(var_buf, tmpbuf, sizeof(var_buf)); } } break; } printf("%s = %s (type: %d)\n", name_buf, var_buf, v.sqltype); } printf("\n"); } } EXEC SQL CLOSE cur1; EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL; return 0; }
The output of this example should look something like the following (some numbers will vary).
oid = 1 (type: 1) datname = template1 (type: 1) datdba = 10 (type: 1) encoding = 0 (type: 5) datistemplate = t (type: 1) datallowconn = t (type: 1) datconnlimit = -1 (type: 5) datlastsysoid = 11510 (type: 1) datfrozenxid = 379 (type: 1) dattablespace = 1663 (type: 1) datconfig = (type: 1) datacl = {=c/uptime,uptime=CTc/uptime} (type: 1) datid = 1 (type: 1) datname = template1 (type: 1) numbackends = 0 (type: 5) xact_commit = 113606 (type: 9) xact_rollback = 0 (type: 9) blks_read = 130 (type: 9) blks_hit = 7341714 (type: 9) tup_returned = 38262679 (type: 9) tup_fetched = 1836281 (type: 9) tup_inserted = 0 (type: 9) tup_updated = 0 (type: 9) tup_deleted = 0 (type: 9) oid = 11511 (type: 1) datname = postgres (type: 1) datdba = 10 (type: 1) encoding = 0 (type: 5) datistemplate = f (type: 1) datallowconn = t (type: 1) datconnlimit = -1 (type: 5) datlastsysoid = 11510 (type: 1) datfrozenxid = 379 (type: 1) dattablespace = 1663 (type: 1) datconfig = (type: 1) datacl = (type: 1) datid = 11511 (type: 1) datname = postgres (type: 1) numbackends = 0 (type: 5) xact_commit = 221069 (type: 9) xact_rollback = 18 (type: 9) blks_read = 1176 (type: 9) blks_hit = 13943750 (type: 9) tup_returned = 77410091 (type: 9) tup_fetched = 3253694 (type: 9) tup_inserted = 0 (type: 9) tup_updated = 0 (type: 9) tup_deleted = 0 (type: 9)
This section describes how you can handle exceptional conditions and warnings in an embedded SQL program. There are two nonexclusive facilities for this.
WHENEVER
command.
sqlca
variable.
One simple method to catch errors and warnings is to set a specific action to be executed whenever a particular condition occurs. In general:
EXEC SQL WHENEVERcondition
action
;
condition
can be one of the following:
SQLERROR
The specified action is called whenever an error occurs during the execution of an SQL statement.
SQLWARNING
The specified action is called whenever a warning occurs during the execution of an SQL statement.
NOT FOUND
The specified action is called whenever an SQL statement retrieves or affects zero rows. (This condition is not an error, but you might be interested in handling it specially.)
action
can be one of the following:
CONTINUE
This effectively means that the condition is ignored. This is the default.
GOTO label
GO TO label
Jump to the specified label (using a C goto
statement).
SQLPRINT
Print a message to standard error. This is useful for simple programs or during prototyping. The details of the message cannot be configured.
STOP
Call exit(1)
, which will terminate the
program.
DO BREAK
Execute the C statement break
. This should
only be used in loops or switch
statements.
DO CONTINUE
Execute the C statement continue
. This should
only be used in loops statements. if executed, will cause the flow
of control to return to the top of the loop.
CALL name
(args
)
DO name
(args
)
Call the specified C functions with the specified arguments. (This
use is different from the meaning of CALL
and DO
in the normal PostgreSQL grammar.)
The SQL standard only provides for the actions
CONTINUE
and GOTO
(and
GO TO
).
Here is an example that you might want to use in a simple program. It prints a simple message when a warning occurs and aborts the program when an error happens:
EXEC SQL WHENEVER SQLWARNING SQLPRINT; EXEC SQL WHENEVER SQLERROR STOP;
The statement EXEC SQL WHENEVER
is a directive
of the SQL preprocessor, not a C statement. The error or warning
actions that it sets apply to all embedded SQL statements that
appear below the point where the handler is set, unless a
different action was set for the same condition between the first
EXEC SQL WHENEVER
and the SQL statement causing
the condition, regardless of the flow of control in the C program.
So neither of the two following C program excerpts will have the
desired effect:
/* * WRONG */ int main(int argc, char *argv[]) { ... if (verbose) { EXEC SQL WHENEVER SQLWARNING SQLPRINT; } ... EXEC SQL SELECT ...; ... }
/* * WRONG */ int main(int argc, char *argv[]) { ... set_error_handler(); ... EXEC SQL SELECT ...; ... } static void set_error_handler(void) { EXEC SQL WHENEVER SQLERROR STOP; }
For more powerful error handling, the embedded SQL interface
provides a global variable with the name sqlca
(SQL communication area)
that has the following structure:
struct { char sqlcaid[8]; long sqlabc; long sqlcode; struct { int sqlerrml; char sqlerrmc[SQLERRMC_LEN]; } sqlerrm; char sqlerrp[8]; long sqlerrd[6]; char sqlwarn[8]; char sqlstate[5]; } sqlca;
(In a multithreaded program, every thread automatically gets its
own copy of sqlca
. This works similarly to the
handling of the standard C global variable
errno
.)
sqlca
covers both warnings and errors. If
multiple warnings or errors occur during the execution of a
statement, then sqlca
will only contain
information about the last one.
If no error occurred in the last SQL statement,
sqlca.sqlcode
will be 0 and
sqlca.sqlstate
will be
"00000"
. If a warning or error occurred, then
sqlca.sqlcode
will be negative and
sqlca.sqlstate
will be different from
"00000"
. A positive
sqlca.sqlcode
indicates a harmless condition,
such as that the last query returned zero rows.
sqlcode
and sqlstate
are two
different error code schemes; details appear below.
If the last SQL statement was successful, then
sqlca.sqlerrd[1]
contains the OID of the
processed row, if applicable, and
sqlca.sqlerrd[2]
contains the number of
processed or returned rows, if applicable to the command.
In case of an error or warning,
sqlca.sqlerrm.sqlerrmc
will contain a string
that describes the error. The field
sqlca.sqlerrm.sqlerrml
contains the length of
the error message that is stored in
sqlca.sqlerrm.sqlerrmc
(the result of
strlen()
, not really interesting for a C
programmer). Note that some messages are too long to fit in the
fixed-size sqlerrmc
array; they will be truncated.
In case of a warning, sqlca.sqlwarn[2]
is set
to W
. (In all other cases, it is set to
something different from W
.) If
sqlca.sqlwarn[1]
is set to
W
, then a value was truncated when it was
stored in a host variable. sqlca.sqlwarn[0]
is
set to W
if any of the other elements are set
to indicate a warning.
The fields sqlcaid
,
sqlabc
,
sqlerrp
, and the remaining elements of
sqlerrd
and
sqlwarn
currently contain no useful
information.
The structure sqlca
is not defined in the SQL
standard, but is implemented in several other SQL database
systems. The definitions are similar at the core, but if you want
to write portable applications, then you should investigate the
different implementations carefully.
Here is one example that combines the use of WHENEVER
and sqlca
, printing out the contents
of sqlca
when an error occurs. This is perhaps
useful for debugging or prototyping applications, before
installing a more “user-friendly” error handler.
EXEC SQL WHENEVER SQLERROR CALL print_sqlca(); void print_sqlca() { fprintf(stderr, "==== sqlca ====\n"); fprintf(stderr, "sqlcode: %ld\n", sqlca.sqlcode); fprintf(stderr, "sqlerrm.sqlerrml: %d\n", sqlca.sqlerrm.sqlerrml); fprintf(stderr, "sqlerrm.sqlerrmc: %s\n", sqlca.sqlerrm.sqlerrmc); fprintf(stderr, "sqlerrd: %ld %ld %ld %ld %ld %ld\n", sqlca.sqlerrd[0],sqlca.sqlerrd[1],sqlca.sqlerrd[2], sqlca.sqlerrd[3],sqlca.sqlerrd[4],sqlca.sqlerrd[5]); fprintf(stderr, "sqlwarn: %d %d %d %d %d %d %d %d\n", sqlca.sqlwarn[0], sqlca.sqlwarn[1], sqlca.sqlwarn[2], sqlca.sqlwarn[3], sqlca.sqlwarn[4], sqlca.sqlwarn[5], sqlca.sqlwarn[6], sqlca.sqlwarn[7]); fprintf(stderr, "sqlstate: %5s\n", sqlca.sqlstate); fprintf(stderr, "===============\n"); }
The result could look as follows (here an error due to a misspelled table name):
==== sqlca ==== sqlcode: -400 sqlerrm.sqlerrml: 49 sqlerrm.sqlerrmc: relation "pg_databasep" does not exist on line 38 sqlerrd: 0 0 0 0 0 0 sqlwarn: 0 0 0 0 0 0 0 0 sqlstate: 42P01 ===============
SQLSTATE
vs. SQLCODE
The fields sqlca.sqlstate
and
sqlca.sqlcode
are two different schemes that
provide error codes. Both are derived from the SQL standard, but
SQLCODE
has been marked deprecated in the SQL-92
edition of the standard and has been dropped in later editions.
Therefore, new applications are strongly encouraged to use
SQLSTATE
.
SQLSTATE
is a five-character array. The five
characters contain digits or upper-case letters that represent
codes of various error and warning conditions.
SQLSTATE
has a hierarchical scheme: the first
two characters indicate the general class of the condition, the
last three characters indicate a subclass of the general
condition. A successful state is indicated by the code
00000
. The SQLSTATE
codes are for
the most part defined in the SQL standard. The
PostgreSQL server natively supports
SQLSTATE
error codes; therefore a high degree
of consistency can be achieved by using this error code scheme
throughout all applications. For further information see
Appendix A.
SQLCODE
, the deprecated error code scheme, is a
simple integer. A value of 0 indicates success, a positive value
indicates success with additional information, a negative value
indicates an error. The SQL standard only defines the positive
value +100, which indicates that the last command returned or
affected zero rows, and no specific negative values. Therefore,
this scheme can only achieve poor portability and does not have a
hierarchical code assignment. Historically, the embedded SQL
processor for PostgreSQL has assigned
some specific SQLCODE
values for its use, which
are listed below with their numeric value and their symbolic name.
Remember that these are not portable to other SQL implementations.
To simplify the porting of applications to the
SQLSTATE
scheme, the corresponding
SQLSTATE
is also listed. There is, however, no
one-to-one or one-to-many mapping between the two schemes (indeed
it is many-to-many), so you should consult the global
SQLSTATE
listing in Appendix A
in each case.
These are the assigned SQLCODE
values:
ECPG_NO_ERROR
)Indicates no error. (SQLSTATE 00000)
ECPG_NOT_FOUND
)This is a harmless condition indicating that the last command retrieved or processed zero rows, or that you are at the end of the cursor. (SQLSTATE 02000)
When processing a cursor in a loop, you could use this code as a way to detect when to abort the loop, like this:
while (1) { EXEC SQL FETCH ... ; if (sqlca.sqlcode == ECPG_NOT_FOUND) break; }
But WHENEVER NOT FOUND DO BREAK
effectively
does this internally, so there is usually no advantage in
writing this out explicitly.
ECPG_OUT_OF_MEMORY
)
Indicates that your virtual memory is exhausted. The numeric
value is defined as -ENOMEM
. (SQLSTATE
YE001)
ECPG_UNSUPPORTED
)Indicates the preprocessor has generated something that the library does not know about. Perhaps you are running incompatible versions of the preprocessor and the library. (SQLSTATE YE002)
ECPG_TOO_MANY_ARGUMENTS
)This means that the command specified more host variables than the command expected. (SQLSTATE 07001 or 07002)
ECPG_TOO_FEW_ARGUMENTS
)This means that the command specified fewer host variables than the command expected. (SQLSTATE 07001 or 07002)
ECPG_TOO_MANY_MATCHES
)This means a query has returned multiple rows but the statement was only prepared to store one result row (for example, because the specified variables are not arrays). (SQLSTATE 21000)
ECPG_INT_FORMAT
)
The host variable is of type int
and the datum in
the database is of a different type and contains a value that
cannot be interpreted as an int
. The library uses
strtol()
for this conversion. (SQLSTATE
42804)
ECPG_UINT_FORMAT
)
The host variable is of type unsigned int
and the
datum in the database is of a different type and contains a
value that cannot be interpreted as an unsigned
int
. The library uses strtoul()
for this conversion. (SQLSTATE 42804)
ECPG_FLOAT_FORMAT
)
The host variable is of type float
and the datum
in the database is of another type and contains a value that
cannot be interpreted as a float
. The library
uses strtod()
for this conversion.
(SQLSTATE 42804)
ECPG_NUMERIC_FORMAT
)
The host variable is of type numeric
and the datum
in the database is of another type and contains a value that
cannot be interpreted as a numeric
value.
(SQLSTATE 42804)
ECPG_INTERVAL_FORMAT
)
The host variable is of type interval
and the datum
in the database is of another type and contains a value that
cannot be interpreted as an interval
value.
(SQLSTATE 42804)
ECPG_DATE_FORMAT
)
The host variable is of type date
and the datum in
the database is of another type and contains a value that
cannot be interpreted as a date
value.
(SQLSTATE 42804)
ECPG_TIMESTAMP_FORMAT
)
The host variable is of type timestamp
and the
datum in the database is of another type and contains a value
that cannot be interpreted as a timestamp
value.
(SQLSTATE 42804)
ECPG_CONVERT_BOOL
)
This means the host variable is of type bool
and
the datum in the database is neither 't'
nor
'f'
. (SQLSTATE 42804)
ECPG_EMPTY
)The statement sent to the PostgreSQL server was empty. (This cannot normally happen in an embedded SQL program, so it might point to an internal error.) (SQLSTATE YE002)
ECPG_MISSING_INDICATOR
)A null value was returned and no null indicator variable was supplied. (SQLSTATE 22002)
ECPG_NO_ARRAY
)An ordinary variable was used in a place that requires an array. (SQLSTATE 42804)
ECPG_DATA_NOT_ARRAY
)The database returned an ordinary variable in a place that requires array value. (SQLSTATE 42804)
ECPG_ARRAY_INSERT
)The value could not be inserted into the array. (SQLSTATE 42804)
ECPG_NO_CONN
)The program tried to access a connection that does not exist. (SQLSTATE 08003)
ECPG_NOT_CONN
)The program tried to access a connection that does exist but is not open. (This is an internal error.) (SQLSTATE YE002)
ECPG_INVALID_STMT
)The statement you are trying to use has not been prepared. (SQLSTATE 26000)
ECPG_INFORMIX_DUPLICATE_KEY
)Duplicate key error, violation of unique constraint (Informix compatibility mode). (SQLSTATE 23505)
ECPG_UNKNOWN_DESCRIPTOR
)The descriptor specified was not found. The statement you are trying to use has not been prepared. (SQLSTATE 33000)
ECPG_INVALID_DESCRIPTOR_INDEX
)The descriptor index specified was out of range. (SQLSTATE 07009)
ECPG_UNKNOWN_DESCRIPTOR_ITEM
)An invalid descriptor item was requested. (This is an internal error.) (SQLSTATE YE002)
ECPG_VAR_NOT_NUMERIC
)During the execution of a dynamic statement, the database returned a numeric value and the host variable was not numeric. (SQLSTATE 07006)
ECPG_VAR_NOT_CHAR
)During the execution of a dynamic statement, the database returned a non-numeric value and the host variable was numeric. (SQLSTATE 07006)
ECPG_INFORMIX_SUBSELECT_NOT_ONE
)A result of the subquery is not single row (Informix compatibility mode). (SQLSTATE 21000)
ECPG_PGSQL
)Some error caused by the PostgreSQL server. The message contains the error message from the PostgreSQL server.
ECPG_TRANS
)The PostgreSQL server signaled that we cannot start, commit, or rollback the transaction. (SQLSTATE 08007)
ECPG_CONNECT
)The connection attempt to the database did not succeed. (SQLSTATE 08001)
ECPG_DUPLICATE_KEY
)Duplicate key error, violation of unique constraint. (SQLSTATE 23505)
ECPG_SUBSELECT_NOT_ONE
)A result for the subquery is not single row. (SQLSTATE 21000)
ECPG_WARNING_UNKNOWN_PORTAL
)An invalid cursor name was specified. (SQLSTATE 34000)
ECPG_WARNING_IN_TRANSACTION
)Transaction is in progress. (SQLSTATE 25001)
ECPG_WARNING_NO_TRANSACTION
)There is no active (in-progress) transaction. (SQLSTATE 25P01)
ECPG_WARNING_PORTAL_EXISTS
)An existing cursor name was specified. (SQLSTATE 42P03)
Several preprocessor directives are available that modify how
the ecpg
preprocessor parses and processes a
file.
To include an external file into your embedded SQL program, use:
EXEC SQL INCLUDEfilename
; EXEC SQL INCLUDE <filename
>; EXEC SQL INCLUDE "filename
";
The embedded SQL preprocessor will look for a file named
,
preprocess it, and include it in the resulting C output. Thus,
embedded SQL statements in the included file are handled correctly.
filename
.h
The ecpg
preprocessor will search a file at
several directories in following order:
/usr/local/include
/usr/local/pgsql/include
)/usr/include
But when EXEC SQL INCLUDE
"
is used, only the
current directory is searched.
filename
"
In each directory, the preprocessor will first look for the file
name as given, and if not found will append .h
to the file name and try again (unless the specified file name
already has that suffix).
Note that EXEC SQL INCLUDE
is not the same as:
#include <filename
.h>
because this file would not be subject to SQL command preprocessing.
Naturally, you can continue to use the C
#include
directive to include other header
files.
The include file name is case-sensitive, even though the rest of
the EXEC SQL INCLUDE
command follows the normal
SQL case-sensitivity rules.
Similar to the directive #define
that is known from C,
embedded SQL has a similar concept:
EXEC SQL DEFINEname
; EXEC SQL DEFINEname
value
;
So you can define a name:
EXEC SQL DEFINE HAVE_FEATURE;
And you can also define constants:
EXEC SQL DEFINE MYNUMBER 12; EXEC SQL DEFINE MYSTRING 'abc';
Use undef
to remove a previous definition:
EXEC SQL UNDEF MYNUMBER;
Of course you can continue to use the C versions #define
and #undef
in your embedded SQL program. The difference
is where your defined values get evaluated. If you use EXEC SQL
DEFINE
then the ecpg
preprocessor evaluates the defines and substitutes
the values. For example if you write:
EXEC SQL DEFINE MYNUMBER 12; ... EXEC SQL UPDATE Tbl SET col = MYNUMBER;
then ecpg
will already do the substitution and your C compiler will never
see any name or identifier MYNUMBER
. Note that you cannot use
#define
for a constant that you are going to use in an
embedded SQL query because in this case the embedded SQL precompiler is not
able to see this declaration.
If multiple input files are named on the ecpg
preprocessor's command line, the effects of EXEC SQL
DEFINE
and EXEC SQL UNDEF
do not carry
across files: each file starts with only the symbols defined
by -D
switches on the command line.
You can use the following directives to compile code sections conditionally:
EXEC SQL ifdef name
;
Checks a name
and processes subsequent lines if
name
has been defined via EXEC SQL define
.
name
EXEC SQL ifndef name
;
Checks a name
and processes subsequent lines if
name
has not been defined via
EXEC SQL define
.
name
EXEC SQL elif name
;
Begins an optional alternative section after an
EXEC SQL ifdef
or
name
EXEC SQL ifndef
directive. Any number of name
elif
sections can appear.
Lines following an elif
will be processed
if name
has been
defined and no previous section of the same
ifdef
/ifndef
...endif
construct has been processed.
EXEC SQL else;
Begins an optional, final alternative section after an
EXEC SQL ifdef
or
name
EXEC SQL ifndef
directive. Subsequent lines will be processed if no previous section
of the same
name
ifdef
/ifndef
...endif
construct has been processed.
EXEC SQL endif;
Ends an
ifdef
/ifndef
...endif
construct. Subsequent lines are processed normally.
ifdef
/ifndef
...endif
constructs can be nested, up to 127 levels deep.
This example will compile exactly one of the three SET
TIMEZONE
commands:
EXEC SQL ifdef TZVAR; EXEC SQL SET TIMEZONE TO TZVAR; EXEC SQL elif TZNAME; EXEC SQL SET TIMEZONE TO TZNAME; EXEC SQL else; EXEC SQL SET TIMEZONE TO 'GMT'; EXEC SQL endif;
Now that you have an idea how to form embedded SQL C programs, you probably want to know how to compile them. Before compiling you run the file through the embedded SQL C preprocessor, which converts the SQL statements you used to special function calls. After compiling, you must link with a special library that contains the needed functions. These functions fetch information from the arguments, perform the SQL command using the libpq interface, and put the result in the arguments specified for output.
The preprocessor program is called ecpg
and is
included in a normal PostgreSQL installation.
Embedded SQL programs are typically named with an extension
.pgc
. If you have a program file called
prog1.pgc
, you can preprocess it by simply
calling:
ecpg prog1.pgc
This will create a file called prog1.c
. If
your input files do not follow the suggested naming pattern, you
can specify the output file explicitly using the
-o
option.
The preprocessed file can be compiled normally, for example:
cc -c prog1.c
The generated C source files include header files from the
PostgreSQL installation, so if you installed
PostgreSQL in a location that is not searched by
default, you have to add an option such as
-I/usr/local/pgsql/include
to the compilation
command line.
To link an embedded SQL program, you need to include the
libecpg
library, like so:
cc -o myprog prog1.o prog2.o ... -lecpg
Again, you might have to add an option like
-L/usr/local/pgsql/lib
to that command line.
You can
use pg_config
or pkg-config
with package name libecpg
to
get the paths for your installation.
If you manage the build process of a larger project using make, it might be convenient to include the following implicit rule to your makefiles:
ECPG = ecpg %.c: %.pgc $(ECPG) $<
The complete syntax of the ecpg
command is
detailed in ecpg.
The ecpg library is thread-safe by default. However, you might need to use some threading command-line options to compile your client code.
The libecpg
library primarily contains
“hidden” functions that are used to implement the
functionality expressed by the embedded SQL commands. But there
are some functions that can usefully be called directly. Note that
this makes your code unportable.
ECPGdebug(int
turns on debug
logging if called with the first argument non-zero. Debug logging
is done on on
, FILE
*stream
)stream
. The log contains
all SQL statements with all the input
variables inserted, and the results from the
PostgreSQL server. This can be very
useful when searching for errors in your SQL
statements.
On Windows, if the ecpg libraries and an application are
compiled with different flags, this function call will crash the
application because the internal representation of the
FILE
pointers differ. Specifically,
multithreaded/single-threaded, release/debug, and static/dynamic
flags should be the same for the library and all applications using
that library.
ECPGget_PGconn(const char *
returns the library database connection handle identified by the given name.
If connection_name
)
connection_name
is set to NULL
, the current
connection handle is returned. If no connection handle can be identified, the function returns
NULL
. The returned connection handle can be used to call any other functions
from libpq, if necessary.
It is a bad idea to manipulate database connection handles made from ecpg directly with libpq routines.
ECPGtransactionStatus(const char *
returns the current transaction status of the given connection identified by connection_name
)connection_name
.
See Section 34.2 and libpq's PQtransactionStatus
for details about the returned status codes.
ECPGstatus(int
returns true if you are connected to a database and false if not.
lineno
,
const char* connection_name
)connection_name
can be NULL
if a single connection is being used.
Large objects are not directly supported by ECPG, but ECPG
application can manipulate large objects through the libpq large
object functions, obtaining the necessary PGconn
object by calling the ECPGget_PGconn()
function. (However, use of
the ECPGget_PGconn()
function and touching
PGconn
objects directly should be done very carefully
and ideally not mixed with other ECPG database access calls.)
For more details about the ECPGget_PGconn()
, see
Section 36.11. For information about the large
object function interface, see Chapter 35.
Large object functions have to be called in a transaction block, so
when autocommit is off, BEGIN
commands have to
be issued explicitly.
Example 36.2 shows an example program that illustrates how to create, write, and read a large object in an ECPG application.
Example 36.2. ECPG Program Accessing Large Objects
#include <stdio.h> #include <stdlib.h> #include <libpq-fe.h> #include <libpq/libpq-fs.h> EXEC SQL WHENEVER SQLERROR STOP; int main(void) { PGconn *conn; Oid loid; int fd; char buf[256]; int buflen = 256; char buf2[256]; int rc; memset(buf, 1, buflen); EXEC SQL CONNECT TO testdb AS con1; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; conn = ECPGget_PGconn("con1"); printf("conn = %p\n", conn); /* create */ loid = lo_create(conn, 0); if (loid < 0) printf("lo_create() failed: %s", PQerrorMessage(conn)); printf("loid = %d\n", loid); /* write test */ fd = lo_open(conn, loid, INV_READ|INV_WRITE); if (fd < 0) printf("lo_open() failed: %s", PQerrorMessage(conn)); printf("fd = %d\n", fd); rc = lo_write(conn, fd, buf, buflen); if (rc < 0) printf("lo_write() failed\n"); rc = lo_close(conn, fd); if (rc < 0) printf("lo_close() failed: %s", PQerrorMessage(conn)); /* read test */ fd = lo_open(conn, loid, INV_READ); if (fd < 0) printf("lo_open() failed: %s", PQerrorMessage(conn)); printf("fd = %d\n", fd); rc = lo_read(conn, fd, buf2, buflen); if (rc < 0) printf("lo_read() failed\n"); rc = lo_close(conn, fd); if (rc < 0) printf("lo_close() failed: %s", PQerrorMessage(conn)); /* check */ rc = memcmp(buf, buf2, buflen); printf("memcmp() = %d\n", rc); /* cleanup */ rc = lo_unlink(conn, loid); if (rc < 0) printf("lo_unlink() failed: %s", PQerrorMessage(conn)); EXEC SQL COMMIT; EXEC SQL DISCONNECT ALL; return 0; }
ECPG has some limited support for C++ applications. This section describes some caveats.
The ecpg
preprocessor takes an input file
written in C (or something like C) and embedded SQL commands,
converts the embedded SQL commands into C language chunks, and
finally generates a .c
file. The header file
declarations of the library functions used by the C language chunks
that ecpg
generates are wrapped
in extern "C" { ... }
blocks when used under
C++, so they should work seamlessly in C++.
In general, however, the ecpg
preprocessor only
understands C; it does not handle the special syntax and reserved
words of the C++ language. So, some embedded SQL code written in
C++ application code that uses complicated features specific to C++
might fail to be preprocessed correctly or might not work as
expected.
A safe way to use the embedded SQL code in a C++ application is hiding the ECPG calls in a C module, which the C++ application code calls into to access the database, and linking that together with the rest of the C++ code. See Section 36.13.2 about that.
The ecpg
preprocessor understands the scope of
variables in C. In the C language, this is rather simple because
the scopes of variables is based on their code blocks. In C++,
however, the class member variables are referenced in a different
code block from the declared position, so
the ecpg
preprocessor will not understand the
scope of the class member variables.
For example, in the following case, the ecpg
preprocessor cannot find any declaration for the
variable dbname
in the test
method, so an error will occur.
class TestCpp { EXEC SQL BEGIN DECLARE SECTION; char dbname[1024]; EXEC SQL END DECLARE SECTION; public: TestCpp(); void test(); ~TestCpp(); }; TestCpp::TestCpp() { EXEC SQL CONNECT TO testdb1; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; } void Test::test() { EXEC SQL SELECT current_database() INTO :dbname; printf("current_database = %s\n", dbname); } TestCpp::~TestCpp() { EXEC SQL DISCONNECT ALL; }
This code will result in an error like this:
ecpg test_cpp.pgc
test_cpp.pgc:28: ERROR: variable "dbname" is not declared
To avoid this scope issue, the test
method
could be modified to use a local variable as intermediate storage.
But this approach is only a poor workaround, because it uglifies
the code and reduces performance.
void TestCpp::test() { EXEC SQL BEGIN DECLARE SECTION; char tmp[1024]; EXEC SQL END DECLARE SECTION; EXEC SQL SELECT current_database() INTO :tmp; strlcpy(dbname, tmp, sizeof(tmp)); printf("current_database = %s\n", dbname); }
If you understand these technical limitations of
the ecpg
preprocessor in C++, you might come to
the conclusion that linking C objects and C++ objects at the link
stage to enable C++ applications to use ECPG features could be
better than writing some embedded SQL commands in C++ code
directly. This section describes a way to separate some embedded
SQL commands from C++ application code with a simple example. In
this example, the application is implemented in C++, while C and
ECPG is used to connect to the PostgreSQL server.
Three kinds of files have to be created: a C file
(*.pgc
), a header file, and a C++ file:
test_mod.pgc
A sub-routine module to execute SQL commands embedded in C.
It is going to be converted
into test_mod.c
by the preprocessor.
#include "test_mod.h" #include <stdio.h> void db_connect() { EXEC SQL CONNECT TO testdb1; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; } void db_test() { EXEC SQL BEGIN DECLARE SECTION; char dbname[1024]; EXEC SQL END DECLARE SECTION; EXEC SQL SELECT current_database() INTO :dbname; printf("current_database = %s\n", dbname); } void db_disconnect() { EXEC SQL DISCONNECT ALL; }
test_mod.h
A header file with declarations of the functions in the C
module (test_mod.pgc
). It is included by
test_cpp.cpp
. This file has to have an
extern "C"
block around the declarations,
because it will be linked from the C++ module.
#ifdef __cplusplus extern "C" { #endif void db_connect(); void db_test(); void db_disconnect(); #ifdef __cplusplus } #endif
test_cpp.cpp
The main code for the application, including
the main
routine, and in this example a
C++ class.
#include "test_mod.h" class TestCpp { public: TestCpp(); void test(); ~TestCpp(); }; TestCpp::TestCpp() { db_connect(); } void TestCpp::test() { db_test(); } TestCpp::~TestCpp() { db_disconnect(); } int main(void) { TestCpp *t = new TestCpp(); t->test(); return 0; }
To build the application, proceed as follows. Convert
test_mod.pgc
into test_mod.c
by
running ecpg
, and generate
test_mod.o
by compiling
test_mod.c
with the C compiler:
ecpg -o test_mod.c test_mod.pgc cc -c test_mod.c -o test_mod.o
Next, generate test_cpp.o
by compiling
test_cpp.cpp
with the C++ compiler:
c++ -c test_cpp.cpp -o test_cpp.o
Finally, link these object files, test_cpp.o
and test_mod.o
, into one executable, using the C++
compiler driver:
c++ test_cpp.o test_mod.o -lecpg -o test_cpp
This section describes all SQL commands that are specific to embedded SQL. Also refer to the SQL commands listed in SQL Commands, which can also be used in embedded SQL, unless stated otherwise.
ALLOCATE DESCRIPTOR — allocate an SQL descriptor area
ALLOCATE DESCRIPTOR name
ALLOCATE DESCRIPTOR
allocates a new named SQL
descriptor area, which can be used to exchange data between the
PostgreSQL server and the host program.
Descriptor areas should be freed after use using
the DEALLOCATE DESCRIPTOR
command.
name
A name of SQL descriptor, case sensitive. This can be an SQL identifier or a host variable.
EXEC SQL ALLOCATE DESCRIPTOR mydesc;
ALLOCATE DESCRIPTOR
is specified in the SQL
standard.
CONNECT — establish a database connection
CONNECT TOconnection_target
[ ASconnection_name
] [ USERconnection_user
] CONNECT TO DEFAULT CONNECTconnection_user
DATABASEconnection_target
The CONNECT
command establishes a connection
between the client and the PostgreSQL server.
connection_target
connection_target
specifies the target server of the connection on one of
several forms.
database_name
] [ @
host
] [ :
port
]Connect over TCP/IP
unix:postgresql://
host
[ :
port
] /
[ database_name
] [ ?
connection_option
]Connect over Unix-domain sockets
tcp:postgresql://
host
[ :
port
] /
[ database_name
] [ ?
connection_option
]Connect over TCP/IP
containing a value in one of the above forms
host variable of type char[]
or VARCHAR[]
containing a value in one of the
above forms
connection_name
An optional identifier for the connection, so that it can be referred to in other commands. This can be an SQL identifier or a host variable.
connection_user
The user name for the database connection.
This parameter can also specify user name and password, using one the forms
,
user_name
/password
, or
user_name
IDENTIFIED BY password
.
user_name
USING password
User name and password can be SQL identifiers, string constants, or host variables.
DEFAULT
Use all default connection parameters, as defined by libpq.
Here a several variants for specifying connection parameters:
EXEC SQL CONNECT TO "connectdb" AS main; EXEC SQL CONNECT TO "connectdb" AS second; EXEC SQL CONNECT TO "unix:postgresql://200.46.204.71/connectdb" AS main USER connectuser; EXEC SQL CONNECT TO "unix:postgresql://localhost/connectdb" AS main USER connectuser; EXEC SQL CONNECT TO 'connectdb' AS main; EXEC SQL CONNECT TO 'unix:postgresql://localhost/connectdb' AS main USER :user; EXEC SQL CONNECT TO :db AS :id; EXEC SQL CONNECT TO :db USER connectuser USING :pw; EXEC SQL CONNECT TO @localhost AS main USER connectdb; EXEC SQL CONNECT TO REGRESSDB1 as main; EXEC SQL CONNECT TO AS main USER connectdb; EXEC SQL CONNECT TO connectdb AS :id; EXEC SQL CONNECT TO connectdb AS main USER connectuser/connectdb; EXEC SQL CONNECT TO connectdb AS main; EXEC SQL CONNECT TO connectdb@localhost AS main; EXEC SQL CONNECT TO tcp:postgresql://localhost/ USER connectdb; EXEC SQL CONNECT TO tcp:postgresql://localhost/connectdb USER connectuser IDENTIFIED BY connectpw; EXEC SQL CONNECT TO tcp:postgresql://localhost:20/connectdb USER connectuser IDENTIFIED BY connectpw; EXEC SQL CONNECT TO unix:postgresql://localhost/ AS main USER connectdb; EXEC SQL CONNECT TO unix:postgresql://localhost/connectdb AS main USER connectuser; EXEC SQL CONNECT TO unix:postgresql://localhost/connectdb USER connectuser IDENTIFIED BY "connectpw"; EXEC SQL CONNECT TO unix:postgresql://localhost/connectdb USER connectuser USING "connectpw"; EXEC SQL CONNECT TO unix:postgresql://localhost/connectdb?connect_timeout=14 USER connectuser;
Here is an example program that illustrates the use of host variables to specify connection parameters:
int main(void) { EXEC SQL BEGIN DECLARE SECTION; char *dbname = "testdb"; /* database name */ char *user = "testuser"; /* connection user name */ char *connection = "tcp:postgresql://localhost:5432/testdb"; /* connection string */ char ver[256]; /* buffer to store the version string */ EXEC SQL END DECLARE SECTION; ECPGdebug(1, stderr); EXEC SQL CONNECT TO :dbname USER :user; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL SELECT version() INTO :ver; EXEC SQL DISCONNECT; printf("version: %s\n", ver); EXEC SQL CONNECT TO :connection USER :user; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL SELECT version() INTO :ver; EXEC SQL DISCONNECT; printf("version: %s\n", ver); return 0; }
CONNECT
is specified in the SQL standard, but
the format of the connection parameters is
implementation-specific.
DEALLOCATE DESCRIPTOR — deallocate an SQL descriptor area
DEALLOCATE DESCRIPTOR name
DEALLOCATE DESCRIPTOR
deallocates a named SQL
descriptor area.
name
The name of the descriptor which is going to be deallocated. It is case sensitive. This can be an SQL identifier or a host variable.
EXEC SQL DEALLOCATE DESCRIPTOR mydesc;
DEALLOCATE DESCRIPTOR
is specified in the SQL
standard.
DECLARE — define a cursor
DECLAREcursor_name
[ BINARY ] [ ASENSITIVE | INSENSITIVE ] [ [ NO ] SCROLL ] CURSOR [ { WITH | WITHOUT } HOLD ] FORprepared_name
DECLAREcursor_name
[ BINARY ] [ ASENSITIVE | INSENSITIVE ] [ [ NO ] SCROLL ] CURSOR [ { WITH | WITHOUT } HOLD ] FORquery
DECLARE
declares a cursor for iterating over
the result set of a prepared statement. This command has
slightly different semantics from the direct SQL
command DECLARE
: Whereas the latter executes a
query and prepares the result set for retrieval, this embedded
SQL command merely declares a name as a “loop
variable” for iterating over the result set of a query;
the actual execution happens when the cursor is opened with
the OPEN
command.
Examples declaring a cursor for a query:
EXEC SQL DECLARE C CURSOR FOR SELECT * FROM My_Table; EXEC SQL DECLARE C CURSOR FOR SELECT Item1 FROM T; EXEC SQL DECLARE cur1 CURSOR FOR SELECT version();
An example declaring a cursor for a prepared statement:
EXEC SQL PREPARE stmt1 AS SELECT version(); EXEC SQL DECLARE cur1 CURSOR FOR stmt1;
DECLARE
is specified in the SQL standard.
DECLARE STATEMENT — declare SQL statement identifier
EXEC SQL [ ATconnection_name
] DECLAREstatement_name
STATEMENT
DECLARE STATEMENT
declares an SQL statement identifier.
SQL statement identifier can be associated with the connection.
When the identifier is used by dynamic SQL statements, the statements
are executed using the associated connection.
The namespace of the declaration is the precompile unit, and multiple
declarations to the same SQL statement identifier are not allowed.
Note that if the precompiler runs in Informix compatibility mode and
some SQL statement is declared, "database" can not be used as a cursor
name.
connection_name
A database connection name established by the CONNECT
command.
AT clause can be omitted, but such statement has no meaning.
statement_name
The name of an SQL statement identifier, either as an SQL identifier or a host variable.
This association is valid only if the declaration is physically placed on top of a dynamic statement.
EXEC SQL CONNECT TO postgres AS con1; EXEC SQL AT con1 DECLARE sql_stmt STATEMENT; EXEC SQL DECLARE cursor_name CURSOR FOR sql_stmt; EXEC SQL PREPARE sql_stmt FROM :dyn_string; EXEC SQL OPEN cursor_name; EXEC SQL FETCH cursor_name INTO :column1; EXEC SQL CLOSE cursor_name;
DECLARE STATEMENT
is an extension of the SQL standard,
but can be used in famous DBMSs.
DESCRIBE — obtain information about a prepared statement or result set
DESCRIBE [ OUTPUT ]prepared_name
USING [ SQL ] DESCRIPTORdescriptor_name
DESCRIBE [ OUTPUT ]prepared_name
INTO [ SQL ] DESCRIPTORdescriptor_name
DESCRIBE [ OUTPUT ]prepared_name
INTOsqlda_name
DESCRIBE
retrieves metadata information about
the result columns contained in a prepared statement, without
actually fetching a row.
prepared_name
The name of a prepared statement. This can be an SQL identifier or a host variable.
descriptor_name
A descriptor name. It is case sensitive. It can be an SQL identifier or a host variable.
sqlda_name
The name of an SQLDA variable.
EXEC SQL ALLOCATE DESCRIPTOR mydesc; EXEC SQL PREPARE stmt1 FROM :sql_stmt; EXEC SQL DESCRIBE stmt1 INTO SQL DESCRIPTOR mydesc; EXEC SQL GET DESCRIPTOR mydesc VALUE 1 :charvar = NAME; EXEC SQL DEALLOCATE DESCRIPTOR mydesc;
DESCRIBE
is specified in the SQL standard.
DISCONNECT — terminate a database connection
DISCONNECT connection_name
DISCONNECT [ CURRENT ]
DISCONNECT ALL
DISCONNECT
closes a connection (or all
connections) to the database.
connection_name
A database connection name established by
the CONNECT
command.
CURRENT
Close the “current” connection, which is either
the most recently opened connection, or the connection set by
the SET CONNECTION
command. This is also
the default if no argument is given to
the DISCONNECT
command.
ALL
Close all open connections.
int main(void) { EXEC SQL CONNECT TO testdb AS con1 USER testuser; EXEC SQL CONNECT TO testdb AS con2 USER testuser; EXEC SQL CONNECT TO testdb AS con3 USER testuser; EXEC SQL DISCONNECT CURRENT; /* close con3 */ EXEC SQL DISCONNECT ALL; /* close con2 and con1 */ return 0; }
DISCONNECT
is specified in the SQL standard.
EXECUTE IMMEDIATE — dynamically prepare and execute a statement
EXECUTE IMMEDIATE string
EXECUTE IMMEDIATE
immediately prepares and
executes a dynamically specified SQL statement, without
retrieving result rows.
string
A literal string or a host variable containing the SQL statement to be executed.
In typical usage, the string
is a host
variable reference to a string containing a dynamically-constructed
SQL statement. The case of a literal string is not very useful;
you might as well just write the SQL statement directly, without
the extra typing of EXECUTE IMMEDIATE
.
If you do use a literal string, keep in mind that any double quotes
you might wish to include in the SQL statement must be written as
octal escapes (\042
) not the usual C
idiom \"
. This is because the string is inside
an EXEC SQL
section, so the ECPG lexer parses it
according to SQL rules not C rules. Any embedded backslashes will
later be handled according to C rules; but \"
causes an immediate syntax error because it is seen as ending the
literal.
Here is an example that executes an INSERT
statement using EXECUTE IMMEDIATE
and a host
variable named command
:
sprintf(command, "INSERT INTO test (name, amount, letter) VALUES ('db: ''r1''', 1, 'f')"); EXEC SQL EXECUTE IMMEDIATE :command;
EXECUTE IMMEDIATE
is specified in the SQL standard.
GET DESCRIPTOR — get information from an SQL descriptor area
GET DESCRIPTORdescriptor_name
:cvariable
=descriptor_header_item
[, ... ] GET DESCRIPTORdescriptor_name
VALUEcolumn_number
:cvariable
=descriptor_item
[, ... ]
GET DESCRIPTOR
retrieves information about a
query result set from an SQL descriptor area and stores it into
host variables. A descriptor area is typically populated
using FETCH
or SELECT
before using this command to transfer the information into host
language variables.
This command has two forms: The first form retrieves descriptor “header” items, which apply to the result set in its entirety. One example is the row count. The second form, which requires the column number as additional parameter, retrieves information about a particular column. Examples are the column name and the actual column value.
descriptor_name
A descriptor name.
descriptor_header_item
A token identifying which header information item to retrieve.
Only COUNT
, to get the number of columns in the
result set, is currently supported.
column_number
The number of the column about which information is to be retrieved. The count starts at 1.
descriptor_item
A token identifying which item of information about a column to retrieve. See Section 36.7.1 for a list of supported items.
cvariable
A host variable that will receive the data retrieved from the descriptor area.
An example to retrieve the number of columns in a result set:
EXEC SQL GET DESCRIPTOR d :d_count = COUNT;
An example to retrieve a data length in the first column:
EXEC SQL GET DESCRIPTOR d VALUE 1 :d_returned_octet_length = RETURNED_OCTET_LENGTH;
An example to retrieve the data body of the second column as a string:
EXEC SQL GET DESCRIPTOR d VALUE 2 :d_data = DATA;
Here is an example for a whole procedure of
executing SELECT current_database();
and showing the number of
columns, the column data length, and the column data:
int main(void) { EXEC SQL BEGIN DECLARE SECTION; int d_count; char d_data[1024]; int d_returned_octet_length; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO testdb AS con1 USER testuser; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL ALLOCATE DESCRIPTOR d; /* Declare, open a cursor, and assign a descriptor to the cursor */ EXEC SQL DECLARE cur CURSOR FOR SELECT current_database(); EXEC SQL OPEN cur; EXEC SQL FETCH NEXT FROM cur INTO SQL DESCRIPTOR d; /* Get a number of total columns */ EXEC SQL GET DESCRIPTOR d :d_count = COUNT; printf("d_count = %d\n", d_count); /* Get length of a returned column */ EXEC SQL GET DESCRIPTOR d VALUE 1 :d_returned_octet_length = RETURNED_OCTET_LENGTH; printf("d_returned_octet_length = %d\n", d_returned_octet_length); /* Fetch the returned column as a string */ EXEC SQL GET DESCRIPTOR d VALUE 1 :d_data = DATA; printf("d_data = %s\n", d_data); /* Closing */ EXEC SQL CLOSE cur; EXEC SQL COMMIT; EXEC SQL DEALLOCATE DESCRIPTOR d; EXEC SQL DISCONNECT ALL; return 0; }
When the example is executed, the result will look like this:
d_count = 1 d_returned_octet_length = 6 d_data = testdb
GET DESCRIPTOR
is specified in the SQL standard.
OPEN — open a dynamic cursor
OPENcursor_name
OPENcursor_name
USINGvalue
[, ... ] OPENcursor_name
USING SQL DESCRIPTORdescriptor_name
OPEN
opens a cursor and optionally binds
actual values to the placeholders in the cursor's declaration.
The cursor must previously have been declared with
the DECLARE
command. The execution
of OPEN
causes the query to start executing on
the server.
cursor_name
The name of the cursor to be opened. This can be an SQL identifier or a host variable.
value
A value to be bound to a placeholder in the cursor. This can be an SQL constant, a host variable, or a host variable with indicator.
descriptor_name
The name of a descriptor containing values to be bound to the placeholders in the cursor. This can be an SQL identifier or a host variable.
EXEC SQL OPEN a; EXEC SQL OPEN d USING 1, 'test'; EXEC SQL OPEN c1 USING SQL DESCRIPTOR mydesc; EXEC SQL OPEN :curname1;
OPEN
is specified in the SQL standard.
PREPARE — prepare a statement for execution
PREPAREprepared_name
FROMstring
PREPARE
prepares a statement dynamically
specified as a string for execution. This is different from the
direct SQL statement PREPARE, which can also
be used in embedded programs. The EXECUTE
command is used to execute either kind of prepared statement.
prepared_name
An identifier for the prepared query.
string
A literal string or a host variable containing a preparable
SQL statement, one of SELECT, INSERT, UPDATE, or DELETE.
Use question marks (?
) for parameter values
to be supplied at execution.
In typical usage, the string
is a host
variable reference to a string containing a dynamically-constructed
SQL statement. The case of a literal string is not very useful;
you might as well just write a direct SQL PREPARE
statement.
If you do use a literal string, keep in mind that any double quotes
you might wish to include in the SQL statement must be written as
octal escapes (\042
) not the usual C
idiom \"
. This is because the string is inside
an EXEC SQL
section, so the ECPG lexer parses it
according to SQL rules not C rules. Any embedded backslashes will
later be handled according to C rules; but \"
causes an immediate syntax error because it is seen as ending the
literal.
char *stmt = "SELECT * FROM test1 WHERE a = ? AND b = ?"; EXEC SQL ALLOCATE DESCRIPTOR outdesc; EXEC SQL PREPARE foo FROM :stmt; EXEC SQL EXECUTE foo USING SQL DESCRIPTOR indesc INTO SQL DESCRIPTOR outdesc;
PREPARE
is specified in the SQL standard.
SET AUTOCOMMIT — set the autocommit behavior of the current session
SET AUTOCOMMIT { = | TO } { ON | OFF }
SET AUTOCOMMIT
sets the autocommit behavior of
the current database session. By default, embedded SQL programs
are not in autocommit mode,
so COMMIT
needs to be issued explicitly when
desired. This command can change the session to autocommit mode,
where each individual statement is committed implicitly.
SET AUTOCOMMIT
is an extension of PostgreSQL ECPG.
SET CONNECTION — select a database connection
SET CONNECTION [ TO | = ] connection_name
SET CONNECTION
sets the “current”
database connection, which is the one that all commands use
unless overridden.
connection_name
A database connection name established by
the CONNECT
command.
CURRENT
Set the connection to the current connection (thus, nothing happens).
EXEC SQL SET CONNECTION TO con2; EXEC SQL SET CONNECTION = con1;
SET CONNECTION
is specified in the SQL standard.
SET DESCRIPTOR — set information in an SQL descriptor area
SET DESCRIPTORdescriptor_name
descriptor_header_item
=value
[, ... ] SET DESCRIPTORdescriptor_name
VALUEnumber
descriptor_item
=value
[, ...]
SET DESCRIPTOR
populates an SQL descriptor
area with values. The descriptor area is then typically used to
bind parameters in a prepared query execution.
This command has two forms: The first form applies to the descriptor “header”, which is independent of a particular datum. The second form assigns values to particular datums, identified by number.
descriptor_name
A descriptor name.
descriptor_header_item
A token identifying which header information item to set.
Only COUNT
, to set the number of descriptor
items, is currently supported.
number
The number of the descriptor item to set. The count starts at 1.
descriptor_item
A token identifying which item of information to set in the descriptor. See Section 36.7.1 for a list of supported items.
value
A value to store into the descriptor item. This can be an SQL constant or a host variable.
EXEC SQL SET DESCRIPTOR indesc COUNT = 1; EXEC SQL SET DESCRIPTOR indesc VALUE 1 DATA = 2; EXEC SQL SET DESCRIPTOR indesc VALUE 1 DATA = :val1; EXEC SQL SET DESCRIPTOR indesc VALUE 2 INDICATOR = :val1, DATA = 'some string'; EXEC SQL SET DESCRIPTOR indesc VALUE 2 INDICATOR = :val2null, DATA = :val2;
SET DESCRIPTOR
is specified in the SQL standard.
TYPE — define a new data type
TYPEtype_name
ISctype
The TYPE
command defines a new C type. It is
equivalent to putting a typedef
into a declare
section.
This command is only recognized when ecpg
is
run with the -c
option.
type_name
The name for the new type. It must be a valid C type name.
ctype
A C type specification.
EXEC SQL TYPE customer IS struct { varchar name[50]; int phone; }; EXEC SQL TYPE cust_ind IS struct ind { short name_ind; short phone_ind; }; EXEC SQL TYPE c IS char reference; EXEC SQL TYPE ind IS union { int integer; short smallint; }; EXEC SQL TYPE intarray IS int[AMOUNT]; EXEC SQL TYPE str IS varchar[BUFFERSIZ]; EXEC SQL TYPE string IS char[11];
Here is an example program that uses EXEC SQL
TYPE
:
EXEC SQL WHENEVER SQLERROR SQLPRINT; EXEC SQL TYPE tt IS struct { varchar v[256]; int i; }; EXEC SQL TYPE tt_ind IS struct ind { short v_ind; short i_ind; }; int main(void) { EXEC SQL BEGIN DECLARE SECTION; tt t; tt_ind t_ind; EXEC SQL END DECLARE SECTION; EXEC SQL CONNECT TO testdb AS con1; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL SELECT current_database(), 256 INTO :t:t_ind LIMIT 1; printf("t.v = %s\n", t.v.arr); printf("t.i = %d\n", t.i); printf("t_ind.v_ind = %d\n", t_ind.v_ind); printf("t_ind.i_ind = %d\n", t_ind.i_ind); EXEC SQL DISCONNECT con1; return 0; }
The output from this program looks like this:
t.v = testdb t.i = 256 t_ind.v_ind = 0 t_ind.i_ind = 0
The TYPE
command is a PostgreSQL extension.
VAR — define a variable
VARvarname
ISctype
The VAR
command assigns a new C data type
to a host variable. The host variable must be previously
declared in a declare section.
varname
A C variable name.
ctype
A C type specification.
Exec sql begin declare section; short a; exec sql end declare section; EXEC SQL VAR a IS int;
The VAR
command is a PostgreSQL extension.
WHENEVER — specify the action to be taken when an SQL statement causes a specific class condition to be raised
WHENEVER { NOT FOUND | SQLERROR | SQLWARNING } action
Define a behavior which is called on the special cases (Rows not found, SQL warnings or errors) in the result of SQL execution.
See Section 36.8.1 for a description of the parameters.
EXEC SQL WHENEVER NOT FOUND CONTINUE; EXEC SQL WHENEVER NOT FOUND DO BREAK; EXEC SQL WHENEVER NOT FOUND DO CONTINUE; EXEC SQL WHENEVER SQLWARNING SQLPRINT; EXEC SQL WHENEVER SQLWARNING DO warn(); EXEC SQL WHENEVER SQLERROR sqlprint; EXEC SQL WHENEVER SQLERROR CALL print2(); EXEC SQL WHENEVER SQLERROR DO handle_error("select"); EXEC SQL WHENEVER SQLERROR DO sqlnotice(NULL, NONO); EXEC SQL WHENEVER SQLERROR DO sqlprint(); EXEC SQL WHENEVER SQLERROR GOTO error_label; EXEC SQL WHENEVER SQLERROR STOP;
A typical application is the use of WHENEVER NOT FOUND
BREAK
to handle looping through result sets:
int main(void) { EXEC SQL CONNECT TO testdb AS con1; EXEC SQL SELECT pg_catalog.set_config('search_path', '', false); EXEC SQL COMMIT; EXEC SQL ALLOCATE DESCRIPTOR d; EXEC SQL DECLARE cur CURSOR FOR SELECT current_database(), 'hoge', 256; EXEC SQL OPEN cur; /* when end of result set reached, break out of while loop */ EXEC SQL WHENEVER NOT FOUND DO BREAK; while (1) { EXEC SQL FETCH NEXT FROM cur INTO SQL DESCRIPTOR d; ... } EXEC SQL CLOSE cur; EXEC SQL COMMIT; EXEC SQL DEALLOCATE DESCRIPTOR d; EXEC SQL DISCONNECT ALL; return 0; }
WHENEVER
is specified in the SQL standard, but
most of the actions are PostgreSQL extensions.
ecpg
can be run in a so-called Informix compatibility mode. If
this mode is active, it tries to behave as if it were the Informix
precompiler for Informix E/SQL. Generally spoken this will allow you to use
the dollar sign instead of the EXEC SQL
primitive to introduce
embedded SQL commands:
$int j = 3; $CONNECT TO :dbname; $CREATE TABLE test(i INT PRIMARY KEY, j INT); $INSERT INTO test(i, j) VALUES (7, :j); $COMMIT;
There must not be any white space between the $
and a following preprocessor directive, that is,
include
, define
, ifdef
,
etc. Otherwise, the preprocessor will parse the token as a host
variable.
There are two compatibility modes: INFORMIX
, INFORMIX_SE
When linking programs that use this compatibility mode, remember to link
against libcompat
that is shipped with ECPG.
Besides the previously explained syntactic sugar, the Informix compatibility mode ports some functions for input, output and transformation of data as well as embedded SQL statements known from E/SQL to ECPG.
Informix compatibility mode is closely connected to the pgtypeslib library
of ECPG. pgtypeslib maps SQL data types to data types within the C host
program and most of the additional functions of the Informix compatibility
mode allow you to operate on those C host program types. Note however that
the extent of the compatibility is limited. It does not try to copy Informix
behavior; it allows you to do more or less the same operations and gives
you functions that have the same name and the same basic behavior but it is
no drop-in replacement if you are using Informix at the moment. Moreover,
some of the data types are different. For example,
PostgreSQL's datetime and interval types do not
know about ranges like for example YEAR TO MINUTE
so you won't
find support in ECPG for that either.
The Informix-special "string" pseudo-type for storing right-trimmed character string data is now
supported in Informix-mode without using typedef
. In fact, in Informix-mode,
ECPG refuses to process source files that contain typedef sometype string;
EXEC SQL BEGIN DECLARE SECTION; string userid; /* this variable will contain trimmed data */ EXEC SQL END DECLARE SECTION; EXEC SQL FETCH MYCUR INTO :userid;
CLOSE DATABASE
This statement closes the current connection. In fact, this is a
synonym for ECPG's DISCONNECT CURRENT
:
$CLOSE DATABASE; /* close the current connection */ EXEC SQL CLOSE DATABASE;
FREE cursor_name
Due to the differences how ECPG works compared to Informix's ESQL/C (i.e., which steps
are purely grammar transformations and which steps rely on the underlying run-time library)
there is no FREE cursor_name
statement in ECPG. This is because in ECPG,
DECLARE CURSOR
doesn't translate to a function call into
the run-time library that uses to the cursor name. This means that there's no run-time
bookkeeping of SQL cursors in the ECPG run-time library, only in the PostgreSQL server.
FREE statement_name
FREE statement_name
is a synonym for DEALLOCATE PREPARE statement_name
.
Informix-compatible mode supports a different structure than the one described in Section 36.7.2. See below:
struct sqlvar_compat { short sqltype; int sqllen; char *sqldata; short *sqlind; char *sqlname; char *sqlformat; short sqlitype; short sqlilen; char *sqlidata; int sqlxid; char *sqltypename; short sqltypelen; short sqlownerlen; short sqlsourcetype; char *sqlownername; int sqlsourceid; char *sqlilongdata; int sqlflags; void *sqlreserved; }; struct sqlda_compat { short sqld; struct sqlvar_compat *sqlvar; char desc_name[19]; short desc_occ; struct sqlda_compat *desc_next; void *reserved; }; typedef struct sqlvar_compat sqlvar_t; typedef struct sqlda_compat sqlda_t;
The global properties are:
sqld
The number of fields in the SQLDA
descriptor.
sqlvar
Pointer to the per-field properties.
desc_name
Unused, filled with zero-bytes.
desc_occ
Size of the allocated structure.
desc_next
Pointer to the next SQLDA structure if the result set contains more than one record.
reserved
Unused pointer, contains NULL. Kept for Informix-compatibility.
The per-field properties are below, they are stored in the sqlvar
array:
sqltype
Type of the field. Constants are in sqltypes.h
sqllen
Length of the field data.
sqldata
Pointer to the field data. The pointer is of char *
type,
the data pointed by it is in a binary format. Example:
int intval; switch (sqldata->sqlvar[i].sqltype) { case SQLINTEGER: intval = *(int *)sqldata->sqlvar[i].sqldata; break; ... }
sqlind
Pointer to the NULL indicator. If returned by DESCRIBE or FETCH then it's always a valid pointer.
If used as input for EXECUTE ... USING sqlda;
then NULL-pointer value means
that the value for this field is non-NULL. Otherwise a valid pointer and sqlitype
has to be properly set. Example:
if (*(int2 *)sqldata->sqlvar[i].sqlind != 0) printf("value is NULL\n");
sqlname
Name of the field. 0-terminated string.
sqlformat
Reserved in Informix, value of PQfformat
for the field.
sqlitype
Type of the NULL indicator data. It's always SQLSMINT when returning data from the server.
When the SQLDA
is used for a parameterized query, the data is treated
according to the set type.
sqlilen
Length of the NULL indicator data.
sqlxid
Extended type of the field, result of PQftype
.
sqltypename
sqltypelen
sqlownerlen
sqlsourcetype
sqlownername
sqlsourceid
sqlflags
sqlreserved
Unused.
sqlilongdata
It equals to sqldata
if sqllen
is larger than 32kB.
Example:
EXEC SQL INCLUDE sqlda.h; sqlda_t *sqlda; /* This doesn't need to be under embedded DECLARE SECTION */ EXEC SQL BEGIN DECLARE SECTION; char *prep_stmt = "select * from table1"; int i; EXEC SQL END DECLARE SECTION; ... EXEC SQL PREPARE mystmt FROM :prep_stmt; EXEC SQL DESCRIBE mystmt INTO sqlda; printf("# of fields: %d\n", sqlda->sqld); for (i = 0; i < sqlda->sqld; i++) printf("field %d: \"%s\"\n", sqlda->sqlvar[i]->sqlname); EXEC SQL DECLARE mycursor CURSOR FOR mystmt; EXEC SQL OPEN mycursor; EXEC SQL WHENEVER NOT FOUND GOTO out; while (1) { EXEC SQL FETCH mycursor USING sqlda; } EXEC SQL CLOSE mycursor; free(sqlda); /* The main structure is all to be free(), * sqlda and sqlda->sqlvar is in one allocated area */
For more information, see the sqlda.h
header and the
src/interfaces/ecpg/test/compat_informix/sqlda.pgc
regression test.
decadd
Add two decimal type values.
int decadd(decimal *arg1, decimal *arg2, decimal *sum);
The function receives a pointer to the first operand of type decimal
(arg1
), a pointer to the second operand of type decimal
(arg2
) and a pointer to a value of type decimal that will
contain the sum (sum
). On success, the function returns 0.
ECPG_INFORMIX_NUM_OVERFLOW
is returned in case of overflow and
ECPG_INFORMIX_NUM_UNDERFLOW
in case of underflow. -1 is returned for
other failures and errno
is set to the respective errno
number of the
pgtypeslib.
deccmp
Compare two variables of type decimal.
int deccmp(decimal *arg1, decimal *arg2);
The function receives a pointer to the first decimal value
(arg1
), a pointer to the second decimal value
(arg2
) and returns an integer value that indicates which is
the bigger value.
1, if the value that arg1
points to is bigger than the
value that var2
points to
-1, if the value that arg1
points to is smaller than the
value that arg2
points to
0, if the value that arg1
points to and the value that
arg2
points to are equal
deccopy
Copy a decimal value.
void deccopy(decimal *src, decimal *target);
The function receives a pointer to the decimal value that should be
copied as the first argument (src
) and a pointer to the
target structure of type decimal (target
) as the second
argument.
deccvasc
Convert a value from its ASCII representation into a decimal type.
int deccvasc(char *cp, int len, decimal *np);
The function receives a pointer to string that contains the string
representation of the number to be converted (cp
) as well
as its length len
. np
is a pointer to the
decimal value that saves the result of the operation.
Valid formats are for example:
-2
,
.794
,
+3.44
,
592.49E07
or
-32.84e-4
.
The function returns 0 on success. If overflow or underflow occurred,
ECPG_INFORMIX_NUM_OVERFLOW
or
ECPG_INFORMIX_NUM_UNDERFLOW
is returned. If the ASCII
representation could not be parsed,
ECPG_INFORMIX_BAD_NUMERIC
is returned or
ECPG_INFORMIX_BAD_EXPONENT
if this problem occurred while
parsing the exponent.
deccvdbl
Convert a value of type double to a value of type decimal.
int deccvdbl(double dbl, decimal *np);
The function receives the variable of type double that should be
converted as its first argument (dbl
). As the second
argument (np
), the function receives a pointer to the
decimal variable that should hold the result of the operation.
The function returns 0 on success and a negative value if the conversion failed.
deccvint
Convert a value of type int to a value of type decimal.
int deccvint(int in, decimal *np);
The function receives the variable of type int that should be
converted as its first argument (in
). As the second
argument (np
), the function receives a pointer to the
decimal variable that should hold the result of the operation.
The function returns 0 on success and a negative value if the conversion failed.
deccvlong
Convert a value of type long to a value of type decimal.
int deccvlong(long lng, decimal *np);
The function receives the variable of type long that should be
converted as its first argument (lng
). As the second
argument (np
), the function receives a pointer to the
decimal variable that should hold the result of the operation.
The function returns 0 on success and a negative value if the conversion failed.
decdiv
Divide two variables of type decimal.
int decdiv(decimal *n1, decimal *n2, decimal *result);
The function receives pointers to the variables that are the first
(n1
) and the second (n2
) operands and
calculates n1
/n2
. result
is a
pointer to the variable that should hold the result of the operation.
On success, 0 is returned and a negative value if the division fails.
If overflow or underflow occurred, the function returns
ECPG_INFORMIX_NUM_OVERFLOW
or
ECPG_INFORMIX_NUM_UNDERFLOW
respectively. If an attempt to
divide by zero is observed, the function returns
ECPG_INFORMIX_DIVIDE_ZERO
.
decmul
Multiply two decimal values.
int decmul(decimal *n1, decimal *n2, decimal *result);
The function receives pointers to the variables that are the first
(n1
) and the second (n2
) operands and
calculates n1
*n2
. result
is a
pointer to the variable that should hold the result of the operation.
On success, 0 is returned and a negative value if the multiplication
fails. If overflow or underflow occurred, the function returns
ECPG_INFORMIX_NUM_OVERFLOW
or
ECPG_INFORMIX_NUM_UNDERFLOW
respectively.
decsub
Subtract one decimal value from another.
int decsub(decimal *n1, decimal *n2, decimal *result);
The function receives pointers to the variables that are the first
(n1
) and the second (n2
) operands and
calculates n1
-n2
. result
is a
pointer to the variable that should hold the result of the operation.
On success, 0 is returned and a negative value if the subtraction
fails. If overflow or underflow occurred, the function returns
ECPG_INFORMIX_NUM_OVERFLOW
or
ECPG_INFORMIX_NUM_UNDERFLOW
respectively.
dectoasc
Convert a variable of type decimal to its ASCII representation in a C char* string.
int dectoasc(decimal *np, char *cp, int len, int right)
The function receives a pointer to a variable of type decimal
(np
) that it converts to its textual representation.
cp
is the buffer that should hold the result of the
operation. The parameter right
specifies, how many digits
right of the decimal point should be included in the output. The result
will be rounded to this number of decimal digits. Setting
right
to -1 indicates that all available decimal digits
should be included in the output. If the length of the output buffer,
which is indicated by len
is not sufficient to hold the
textual representation including the trailing zero byte, only a
single *
character is stored in the result and -1 is
returned.
The function returns either -1 if the buffer cp
was too
small or ECPG_INFORMIX_OUT_OF_MEMORY
if memory was
exhausted.
dectodbl
Convert a variable of type decimal to a double.
int dectodbl(decimal *np, double *dblp);
The function receives a pointer to the decimal value to convert
(np
) and a pointer to the double variable that
should hold the result of the operation (dblp
).
On success, 0 is returned and a negative value if the conversion failed.
dectoint
Convert a variable of type decimal to an integer.
int dectoint(decimal *np, int *ip);
The function receives a pointer to the decimal value to convert
(np
) and a pointer to the integer variable that
should hold the result of the operation (ip
).
On success, 0 is returned and a negative value if the conversion
failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW
is returned.
Note that the ECPG implementation differs from the Informix
implementation. Informix limits an integer to the range from -32767 to
32767, while the limits in the ECPG implementation depend on the
architecture (INT_MIN .. INT_MAX
).
dectolong
Convert a variable of type decimal to a long integer.
int dectolong(decimal *np, long *lngp);
The function receives a pointer to the decimal value to convert
(np
) and a pointer to the long variable that
should hold the result of the operation (lngp
).
On success, 0 is returned and a negative value if the conversion
failed. If an overflow occurred, ECPG_INFORMIX_NUM_OVERFLOW
is returned.
Note that the ECPG implementation differs from the Informix
implementation. Informix limits a long integer to the range from
-2,147,483,647 to 2,147,483,647, while the limits in the ECPG
implementation depend on the architecture (-LONG_MAX ..
LONG_MAX
).
rdatestr
Converts a date to a C char* string.
int rdatestr(date d, char *str);
The function receives two arguments, the first one is the date to
convert (d
) and the second one is a pointer to the target
string. The output format is always yyyy-mm-dd
, so you need
to allocate at least 11 bytes (including the zero-byte terminator) for the
string.
The function returns 0 on success and a negative value in case of error.
Note that ECPG's implementation differs from the Informix implementation. In Informix the format can be influenced by setting environment variables. In ECPG however, you cannot change the output format.
rstrdate
Parse the textual representation of a date.
int rstrdate(char *str, date *d);
The function receives the textual representation of the date to convert
(str
) and a pointer to a variable of type date
(d
). This function does not allow you to specify a format
mask. It uses the default format mask of Informix which is
mm/dd/yyyy
. Internally, this function is implemented by
means of rdefmtdate
. Therefore, rstrdate
is
not faster and if you have the choice you should opt for
rdefmtdate
which allows you to specify the format mask
explicitly.
The function returns the same values as rdefmtdate
.
rtoday
Get the current date.
void rtoday(date *d);
The function receives a pointer to a date variable (d
)
that it sets to the current date.
Internally this function uses the PGTYPESdate_today
function.
rjulmdy
Extract the values for the day, the month and the year from a variable of type date.
int rjulmdy(date d, short mdy[3]);
The function receives the date d
and a pointer to an array
of 3 short integer values mdy
. The variable name indicates
the sequential order: mdy[0]
will be set to contain the
number of the month, mdy[1]
will be set to the value of the
day and mdy[2]
will contain the year.
The function always returns 0 at the moment.
Internally the function uses the PGTYPESdate_julmdy
function.
rdefmtdate
Use a format mask to convert a character string to a value of type date.
int rdefmtdate(date *d, char *fmt, char *str);
The function receives a pointer to the date value that should hold the
result of the operation (d
), the format mask to use for
parsing the date (fmt
) and the C char* string containing
the textual representation of the date (str
). The textual
representation is expected to match the format mask. However you do not
need to have a 1:1 mapping of the string to the format mask. The
function only analyzes the sequential order and looks for the literals
yy
or yyyy
that indicate the
position of the year, mm
to indicate the position of
the month and dd
to indicate the position of the
day.
The function returns the following values:
0 - The function terminated successfully.
ECPG_INFORMIX_ENOSHORTDATE
- The date does not contain
delimiters between day, month and year. In this case the input
string must be exactly 6 or 8 bytes long but isn't.
ECPG_INFORMIX_ENOTDMY
- The format string did not
correctly indicate the sequential order of year, month and day.
ECPG_INFORMIX_BAD_DAY
- The input string does not
contain a valid day.
ECPG_INFORMIX_BAD_MONTH
- The input string does not
contain a valid month.
ECPG_INFORMIX_BAD_YEAR
- The input string does not
contain a valid year.
Internally this function is implemented to use the PGTYPESdate_defmt_asc
function. See the reference there for a
table of example input.
rfmtdate
Convert a variable of type date to its textual representation using a format mask.
int rfmtdate(date d, char *fmt, char *str);
The function receives the date to convert (d
), the format
mask (fmt
) and the string that will hold the textual
representation of the date (str
).
On success, 0 is returned and a negative value if an error occurred.
Internally this function uses the PGTYPESdate_fmt_asc
function, see the reference there for examples.
rmdyjul
Create a date value from an array of 3 short integers that specify the day, the month and the year of the date.
int rmdyjul(short mdy[3], date *d);
The function receives the array of the 3 short integers
(mdy
) and a pointer to a variable of type date that should
hold the result of the operation.
Currently the function returns always 0.
Internally the function is implemented to use the function PGTYPESdate_mdyjul
.
rdayofweek
Return a number representing the day of the week for a date value.
int rdayofweek(date d);
The function receives the date variable d
as its only
argument and returns an integer that indicates the day of the week for
this date.
0 - Sunday
1 - Monday
2 - Tuesday
3 - Wednesday
4 - Thursday
5 - Friday
6 - Saturday
Internally the function is implemented to use the function PGTYPESdate_dayofweek
.
dtcurrent
Retrieve the current timestamp.
void dtcurrent(timestamp *ts);
The function retrieves the current timestamp and saves it into the
timestamp variable that ts
points to.
dtcvasc
Parses a timestamp from its textual representation into a timestamp variable.
int dtcvasc(char *str, timestamp *ts);
The function receives the string to parse (str
) and a
pointer to the timestamp variable that should hold the result of the
operation (ts
).
The function returns 0 on success and a negative value in case of error.
Internally this function uses the PGTYPEStimestamp_from_asc
function. See the reference there
for a table with example inputs.
dtcvfmtasc
Parses a timestamp from its textual representation using a format mask into a timestamp variable.
dtcvfmtasc(char *inbuf, char *fmtstr, timestamp *dtvalue)
The function receives the string to parse (inbuf
), the
format mask to use (fmtstr
) and a pointer to the timestamp
variable that should hold the result of the operation
(dtvalue
).
This function is implemented by means of the PGTYPEStimestamp_defmt_asc
function. See the documentation
there for a list of format specifiers that can be used.
The function returns 0 on success and a negative value in case of error.
dtsub
Subtract one timestamp from another and return a variable of type interval.
int dtsub(timestamp *ts1, timestamp *ts2, interval *iv);
The function will subtract the timestamp variable that ts2
points to from the timestamp variable that ts1
points to
and will store the result in the interval variable that iv
points to.
Upon success, the function returns 0 and a negative value if an error occurred.
dttoasc
Convert a timestamp variable to a C char* string.
int dttoasc(timestamp *ts, char *output);
The function receives a pointer to the timestamp variable to convert
(ts
) and the string that should hold the result of the
operation (output
). It converts ts
to its
textual representation according to the SQL standard, which is
be YYYY-MM-DD HH:MM:SS
.
Upon success, the function returns 0 and a negative value if an error occurred.
dttofmtasc
Convert a timestamp variable to a C char* using a format mask.
int dttofmtasc(timestamp *ts, char *output, int str_len, char *fmtstr);
The function receives a pointer to the timestamp to convert as its
first argument (ts
), a pointer to the output buffer
(output
), the maximal length that has been allocated for
the output buffer (str_len
) and the format mask to
use for the conversion (fmtstr
).
Upon success, the function returns 0 and a negative value if an error occurred.
Internally, this function uses the PGTYPEStimestamp_fmt_asc
function. See the reference there for
information on what format mask specifiers can be used.
intoasc
Convert an interval variable to a C char* string.
int intoasc(interval *i, char *str);
The function receives a pointer to the interval variable to convert
(i
) and the string that should hold the result of the
operation (str
). It converts i
to its
textual representation according to the SQL standard, which is
be YYYY-MM-DD HH:MM:SS
.
Upon success, the function returns 0 and a negative value if an error occurred.
rfmtlong
Convert a long integer value to its textual representation using a format mask.
int rfmtlong(long lng_val, char *fmt, char *outbuf);
The function receives the long value lng_val
, the format
mask fmt
and a pointer to the output buffer
outbuf
. It converts the long value according to the format
mask to its textual representation.
The format mask can be composed of the following format specifying characters:
*
(asterisk) - if this position would be blank
otherwise, fill it with an asterisk.
&
(ampersand) - if this position would be
blank otherwise, fill it with a zero.
#
- turn leading zeroes into blanks.
<
- left-justify the number in the string.
,
(comma) - group numbers of four or more digits
into groups of three digits separated by a comma.
.
(period) - this character separates the
whole-number part of the number from the fractional part.
-
(minus) - the minus sign appears if the number
is a negative value.
+
(plus) - the plus sign appears if the number is
a positive value.
(
- this replaces the minus sign in front of the
negative number. The minus sign will not appear.
)
- this character replaces the minus and is
printed behind the negative value.
$
- the currency symbol.
rupshift
Convert a string to upper case.
void rupshift(char *str);
The function receives a pointer to the string and transforms every lower case character to upper case.
byleng
Return the number of characters in a string without counting trailing blanks.
int byleng(char *str, int len);
The function expects a fixed-length string as its first argument
(str
) and its length as its second argument
(len
). It returns the number of significant characters,
that is the length of the string without trailing blanks.
ldchar
Copy a fixed-length string into a null-terminated string.
void ldchar(char *src, int len, char *dest);
The function receives the fixed-length string to copy
(src
), its length (len
) and a pointer to the
destination memory (dest
). Note that you need to reserve at
least len+1
bytes for the string that dest
points to. The function copies at most len
bytes to the new
location (less if the source string has trailing blanks) and adds the
null-terminator.
rgetmsg
int rgetmsg(int msgnum, char *s, int maxsize);
This function exists but is not implemented at the moment!
rtypalign
int rtypalign(int offset, int type);
This function exists but is not implemented at the moment!
rtypmsize
int rtypmsize(int type, int len);
This function exists but is not implemented at the moment!
rtypwidth
int rtypwidth(int sqltype, int sqllen);
This function exists but is not implemented at the moment!
rsetnull
Set a variable to NULL.
int rsetnull(int t, char *ptr);
The function receives an integer that indicates the type of the variable and a pointer to the variable itself that is cast to a C char* pointer.
The following types exist:
CCHARTYPE
- For a variable of type char
or char*
CSHORTTYPE
- For a variable of type short int
CINTTYPE
- For a variable of type int
CBOOLTYPE
- For a variable of type boolean
CFLOATTYPE
- For a variable of type float
CLONGTYPE
- For a variable of type long
CDOUBLETYPE
- For a variable of type double
CDECIMALTYPE
- For a variable of type decimal
CDATETYPE
- For a variable of type date
CDTIMETYPE
- For a variable of type timestamp
Here is an example of a call to this function:
$char c[] = "abc "; $short s = 17; $int i = -74874; rsetnull(CCHARTYPE, (char *) c); rsetnull(CSHORTTYPE, (char *) &s); rsetnull(CINTTYPE, (char *) &i);
risnull
Test if a variable is NULL.
int risnull(int t, char *ptr);
The function receives the type of the variable to test (t
)
as well a pointer to this variable (ptr
). Note that the
latter needs to be cast to a char*. See the function rsetnull
for a list of possible variable types.
Here is an example of how to use this function:
$char c[] = "abc "; $short s = 17; $int i = -74874; risnull(CCHARTYPE, (char *) c); risnull(CSHORTTYPE, (char *) &s); risnull(CINTTYPE, (char *) &i);
Note that all constants here describe errors and all of them are defined to represent negative values. In the descriptions of the different constants you can also find the value that the constants represent in the current implementation. However you should not rely on this number. You can however rely on the fact all of them are defined to represent negative values.
ECPG_INFORMIX_NUM_OVERFLOW
Functions return this value if an overflow occurred in a calculation. Internally it is defined as -1200 (the Informix definition).
ECPG_INFORMIX_NUM_UNDERFLOW
Functions return this value if an underflow occurred in a calculation. Internally it is defined as -1201 (the Informix definition).
ECPG_INFORMIX_DIVIDE_ZERO
Functions return this value if an attempt to divide by zero is observed. Internally it is defined as -1202 (the Informix definition).
ECPG_INFORMIX_BAD_YEAR
Functions return this value if a bad value for a year was found while parsing a date. Internally it is defined as -1204 (the Informix definition).
ECPG_INFORMIX_BAD_MONTH
Functions return this value if a bad value for a month was found while parsing a date. Internally it is defined as -1205 (the Informix definition).
ECPG_INFORMIX_BAD_DAY
Functions return this value if a bad value for a day was found while parsing a date. Internally it is defined as -1206 (the Informix definition).
ECPG_INFORMIX_ENOSHORTDATE
Functions return this value if a parsing routine needs a short date representation but did not get the date string in the right length. Internally it is defined as -1209 (the Informix definition).
ECPG_INFORMIX_DATE_CONVERT
Functions return this value if an error occurred during date formatting. Internally it is defined as -1210 (the Informix definition).
ECPG_INFORMIX_OUT_OF_MEMORY
Functions return this value if memory was exhausted during their operation. Internally it is defined as -1211 (the Informix definition).
ECPG_INFORMIX_ENOTDMY
Functions return this value if a parsing routine was supposed to get a
format mask (like mmddyy
) but not all fields were listed
correctly. Internally it is defined as -1212 (the Informix definition).
ECPG_INFORMIX_BAD_NUMERIC
Functions return this value either if a parsing routine cannot parse the textual representation for a numeric value because it contains errors or if a routine cannot complete a calculation involving numeric variables because at least one of the numeric variables is invalid. Internally it is defined as -1213 (the Informix definition).
ECPG_INFORMIX_BAD_EXPONENT
Functions return this value if a parsing routine cannot parse an exponent. Internally it is defined as -1216 (the Informix definition).
ECPG_INFORMIX_BAD_DATE
Functions return this value if a parsing routine cannot parse a date. Internally it is defined as -1218 (the Informix definition).
ECPG_INFORMIX_EXTRA_CHARS
Functions return this value if a parsing routine is passed extra characters it cannot parse. Internally it is defined as -1264 (the Informix definition).
ecpg
can be run in a so-called Oracle
compatibility mode. If this mode is active, it tries to
behave as if it were Oracle Pro*C.
Specifically, this mode changes ecpg
in three ways:
Pad character arrays receiving character string types with trailing spaces to the specified length
Zero byte terminate these character arrays, and set the indicator variable if truncation occurs
Set the null indicator to -1
when character
arrays receive empty character string types
This section explains how ECPG works internally. This information can occasionally be useful to help users understand how to use ECPG.
The first four lines written by ecpg
to the
output are fixed lines. Two are comments and two are include
lines necessary to interface to the library. Then the
preprocessor reads through the file and writes output. Normally
it just echoes everything to the output.
When it sees an EXEC SQL
statement, it
intervenes and changes it. The command starts with EXEC
SQL
and ends with ;
. Everything in
between is treated as an SQL statement and
parsed for variable substitution.
Variable substitution occurs when a symbol starts with a colon
(:
). The variable with that name is looked up
among the variables that were previously declared within a
EXEC SQL DECLARE
section.
The most important function in the library is
ECPGdo
, which takes care of executing most
commands. It takes a variable number of arguments. This can easily
add up to 50 or so arguments, and we hope this will not be a
problem on any platform.
The arguments are:
This is the line number of the original line; used in error messages only.
This is the SQL command that is to be issued.
It is modified by the input variables, i.e., the variables that
where not known at compile time but are to be entered in the
command. Where the variables should go the string contains
?
.
Every input variable causes ten arguments to be created. (See below.)
ECPGt_EOIT
An enum
telling that there are no more input
variables.
Every output variable causes ten arguments to be created. (See below.) These variables are filled by the function.
ECPGt_EORT
An enum
telling that there are no more variables.
For every variable that is part of the SQL command, the function gets ten arguments:
The type as a special symbol.
A pointer to the value or a pointer to the pointer.
The size of the variable if it is a char
or varchar
.
The number of elements in the array (for array fetches).
The offset to the next element in the array (for array fetches).
The type of the indicator variable as a special symbol.
A pointer to the indicator variable.
0
The number of elements in the indicator array (for array fetches).
The offset to the next element in the indicator array (for array fetches).
Note that not all SQL commands are treated in this way. For instance, an open cursor statement like:
EXEC SQL OPEN cursor
;
is not copied to the output. Instead, the cursor's
DECLARE
command is used at the position of the OPEN
command
because it indeed opens the cursor.
Here is a complete example describing the output of the
preprocessor of a file foo.pgc
(details might
change with each particular version of the preprocessor):
EXEC SQL BEGIN DECLARE SECTION; int index; int result; EXEC SQL END DECLARE SECTION; ... EXEC SQL SELECT res INTO :result FROM mytable WHERE index = :index;
is translated into:
/* Processed by ecpg (2.6.0) */ /* These two include files are added by the preprocessor */ #include <ecpgtype.h>; #include <ecpglib.h>; /* exec sql begin declare section */ #line 1 "foo.pgc" int index; int result; /* exec sql end declare section */ ... ECPGdo(__LINE__, NULL, "SELECT res FROM mytable WHERE index = ? ", ECPGt_int,&(index),1L,1L,sizeof(int), ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT, ECPGt_int,&(result),1L,1L,sizeof(int), ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EORT); #line 147 "foo.pgc"
(The indentation here is added for readability and not something the preprocessor does.)
Table of Contents
information_schema_catalog_name
administrable_role_authorizations
applicable_roles
attributes
character_sets
check_constraint_routine_usage
check_constraints
collations
collation_character_set_applicability
column_column_usage
column_domain_usage
column_options
column_privileges
column_udt_usage
columns
constraint_column_usage
constraint_table_usage
data_type_privileges
domain_constraints
domain_udt_usage
domains
element_types
enabled_roles
foreign_data_wrapper_options
foreign_data_wrappers
foreign_server_options
foreign_servers
foreign_table_options
foreign_tables
key_column_usage
parameters
referential_constraints
role_column_grants
role_routine_grants
role_table_grants
role_udt_grants
role_usage_grants
routine_column_usage
routine_privileges
routine_routine_usage
routine_sequence_usage
routine_table_usage
routines
schemata
sequences
sql_features
sql_implementation_info
sql_parts
sql_sizing
table_constraints
table_privileges
tables
transforms
triggered_update_columns
triggers
udt_privileges
usage_privileges
user_defined_types
user_mapping_options
user_mappings
view_column_usage
view_routine_usage
view_table_usage
views
The information schema consists of a set of views that contain information about the objects defined in the current database. The information schema is defined in the SQL standard and can therefore be expected to be portable and remain stable — unlike the system catalogs, which are specific to PostgreSQL and are modeled after implementation concerns. The information schema views do not, however, contain information about PostgreSQL-specific features; to inquire about those you need to query the system catalogs or other PostgreSQL-specific views.
When querying the database for constraint information, it is possible for a standard-compliant query that expects to return one row to return several. This is because the SQL standard requires constraint names to be unique within a schema, but PostgreSQL does not enforce this restriction. PostgreSQL automatically-generated constraint names avoid duplicates in the same schema, but users can specify such duplicate names.
This problem can appear when querying information schema views such
as check_constraint_routine_usage
,
check_constraints
, domain_constraints
, and
referential_constraints
. Some other views have similar
issues but contain the table name to help distinguish duplicate
rows, e.g., constraint_column_usage
,
constraint_table_usage
, table_constraints
.
The information schema itself is a schema named
information_schema
. This schema automatically
exists in all databases. The owner of this schema is the initial
database user in the cluster, and that user naturally has all the
privileges on this schema, including the ability to drop it (but
the space savings achieved by that are minuscule).
By default, the information schema is not in the schema search path, so you need to access all objects in it through qualified names. Since the names of some of the objects in the information schema are generic names that might occur in user applications, you should be careful if you want to put the information schema in the path.
The columns of the information schema views use special data types that are defined in the information schema. These are defined as simple domains over ordinary built-in types. You should not use these types for work outside the information schema, but your applications must be prepared for them if they select from the information schema.
These types are:
cardinal_number
A nonnegative integer.
character_data
A character string (without specific maximum length).
sql_identifier
A character string. This type is used for SQL identifiers, the
type character_data
is used for any other kind of
text data.
time_stamp
A domain over the type timestamp with time zone
yes_or_no
A character string domain that contains
either YES
or NO
. This
is used to represent Boolean (true/false) data in the
information schema. (The information schema was invented
before the type boolean
was added to the SQL
standard, so this convention is necessary to keep the
information schema backward compatible.)
Every column in the information schema has one of these five types.
information_schema_catalog_name
information_schema_catalog_name
is a table that
always contains one row and one column containing the name of the
current database (current catalog, in SQL terminology).
Table 37.1. information_schema_catalog_name
Columns
Column Type Description |
---|
Name of the database that contains this information schema |
administrable_role_authorizations
The view administrable_role_authorizations
identifies all roles that the current user has the admin option
for.
Table 37.2. administrable_role_authorizations
Columns
Column Type Description |
---|
Name of the role to which this role membership was granted (can be the current user, or a different role in case of nested role memberships) |
Name of a role |
Always |
applicable_roles
The view applicable_roles
identifies all roles
whose privileges the current user can use. This means there is
some chain of role grants from the current user to the role in
question. The current user itself is also an applicable role. The
set of applicable roles is generally used for permission checking.
Table 37.3. applicable_roles
Columns
Column Type Description |
---|
Name of the role to which this role membership was granted (can be the current user, or a different role in case of nested role memberships) |
Name of a role |
|
attributes
The view attributes
contains information about
the attributes of composite data types defined in the database.
(Note that the view does not give information about table columns,
which are sometimes called attributes in PostgreSQL contexts.)
Only those attributes are shown that the current user has access to (by way
of being the owner of or having some privilege on the type).
Table 37.4. attributes
Columns
Column Type Description |
---|
Name of the database containing the data type (always the current database) |
Name of the schema containing the data type |
Name of the data type |
Name of the attribute |
Ordinal position of the attribute within the data type (count starts at 1) |
Default expression of the attribute |
|
Data type of the attribute, if it is a built-in type, or
|
If |
If |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Name of the database containing the collation of the attribute (always the current database), null if default or the data type of the attribute is not collatable |
Name of the schema containing the collation of the attribute, null if default or the data type of the attribute is not collatable |
Name of the collation of the attribute, null if default or the data type of the attribute is not collatable |
If |
If |
If |
If |
If |
Applies to a feature not available
in PostgreSQL
(see |
Name of the database that the attribute data type is defined in (always the current database) |
Name of the schema that the attribute data type is defined in |
Name of the attribute data type |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, because arrays always have unlimited maximum cardinality in PostgreSQL |
An identifier of the data type descriptor of the attribute, unique among the data type descriptors pertaining to the composite type. This is mainly useful for joining with other instances of such identifiers. (The specific format of the identifier is not defined and not guaranteed to remain the same in future versions.) |
Applies to a feature not available in PostgreSQL |
See also under Section 37.17, a similarly structured view, for further information on some of the columns.
character_sets
The view character_sets
identifies the character
sets available in the current database. Since PostgreSQL does not
support multiple character sets within one database, this view only
shows one, which is the database encoding.
Take note of how the following terms are used in the SQL standard:
An abstract collection of characters, for
example UNICODE
, UCS
, or
LATIN1
. Not exposed as an SQL object, but
visible in this view.
An encoding of some character repertoire. Most older character
repertoires only use one encoding form, and so there are no
separate names for them (e.g., LATIN2
is an
encoding form applicable to the LATIN2
repertoire). But for example Unicode has the encoding forms
UTF8
, UTF16
, etc. (not
all supported by PostgreSQL). Encoding forms are not exposed
as an SQL object, but are visible in this view.
A named SQL object that identifies a character repertoire, a
character encoding, and a default collation. A predefined
character set would typically have the same name as an encoding
form, but users could define other names. For example, the
character set UTF8
would typically identify
the character repertoire UCS
, encoding
form UTF8
, and some default collation.
You can think of an “encoding” in PostgreSQL either as a character set or a character encoding form. They will have the same name, and there can only be one in one database.
Table 37.5. character_sets
Columns
Column Type Description |
---|
Character sets are currently not implemented as schema objects, so this column is null. |
Character sets are currently not implemented as schema objects, so this column is null. |
Name of the character set, currently implemented as showing the name of the database encoding |
Character repertoire, showing |
Character encoding form, same as the database encoding |
Name of the database containing the default collation (always the current database, if any collation is identified) |
Name of the schema containing the default collation |
Name of the default collation. The default collation is
identified as the collation that matches
the |
check_constraint_routine_usage
The view check_constraint_routine_usage
identifies routines (functions and procedures) that are used by a
check constraint. Only those routines are shown that are owned by
a currently enabled role.
Table 37.6. check_constraint_routine_usage
Columns
Column Type Description |
---|
Name of the database containing the constraint (always the current database) |
Name of the schema containing the constraint |
Name of the constraint |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
check_constraints
The view check_constraints
contains all check
constraints, either defined on a table or on a domain, that are
owned by a currently enabled role. (The owner of the table or
domain is the owner of the constraint.)
Table 37.7. check_constraints
Columns
Column Type Description |
---|
Name of the database containing the constraint (always the current database) |
Name of the schema containing the constraint |
Name of the constraint |
The check expression of the check constraint |
collations
The view collations
contains the collations
available in the current database.
Table 37.8. collations
Columns
Column Type Description |
---|
Name of the database containing the collation (always the current database) |
Name of the schema containing the collation |
Name of the default collation |
Always |
collation_character_set_applicability
The view collation_character_set_applicability
identifies which character set the available collations are
applicable to. In PostgreSQL, there is only one character set per
database (see explanation
in Section 37.7), so this view does
not provide much useful information.
Table 37.9. collation_character_set_applicability
Columns
Column Type Description |
---|
Name of the database containing the collation (always the current database) |
Name of the schema containing the collation |
Name of the default collation |
Character sets are currently not implemented as schema objects, so this column is null |
Character sets are currently not implemented as schema objects, so this column is null |
Name of the character set |
column_column_usage
The view column_column_usage
identifies all generated
columns that depend on another base column in the same table. Only tables
owned by a currently enabled role are included.
Table 37.10. column_column_usage
Columns
Column Type Description |
---|
Name of the database containing the table (always the current database) |
Name of the schema containing the table |
Name of the table |
Name of the base column that a generated column depends on |
Name of the generated column |
column_domain_usage
The view column_domain_usage
identifies all
columns (of a table or a view) that make use of some domain defined
in the current database and owned by a currently enabled role.
Table 37.11. column_domain_usage
Columns
Column Type Description |
---|
Name of the database containing the domain (always the current database) |
Name of the schema containing the domain |
Name of the domain |
Name of the database containing the table (always the current database) |
Name of the schema containing the table |
Name of the table |
Name of the column |
column_options
The view column_options
contains all the
options defined for foreign table columns in the current database. Only
those foreign table columns are shown that the current user has access to
(by way of being the owner or having some privilege).
Table 37.12. column_options
Columns
Column Type Description |
---|
Name of the database that contains the foreign table (always the current database) |
Name of the schema that contains the foreign table |
Name of the foreign table |
Name of the column |
Name of an option |
Value of the option |
column_privileges
The view column_privileges
identifies all
privileges granted on columns to a currently enabled role or by a
currently enabled role. There is one row for each combination of
column, grantor, and grantee.
If a privilege has been granted on an entire table, it will show up in
this view as a grant for each column, but only for the
privilege types where column granularity is possible:
SELECT
, INSERT
,
UPDATE
, REFERENCES
.
Table 37.13. column_privileges
Columns
Column Type Description |
---|
Name of the role that granted the privilege |
Name of the role that the privilege was granted to |
Name of the database that contains the table that contains the column (always the current database) |
Name of the schema that contains the table that contains the column |
Name of the table that contains the column |
Name of the column |
Type of the privilege: |
|
column_udt_usage
The view column_udt_usage
identifies all columns
that use data types owned by a currently enabled role. Note that in
PostgreSQL, built-in data types behave
like user-defined types, so they are included here as well. See
also Section 37.17 for details.
Table 37.14. column_udt_usage
Columns
Column Type Description |
---|
Name of the database that the column data type (the underlying type of the domain, if applicable) is defined in (always the current database) |
Name of the schema that the column data type (the underlying type of the domain, if applicable) is defined in |
Name of the column data type (the underlying type of the domain, if applicable) |
Name of the database containing the table (always the current database) |
Name of the schema containing the table |
Name of the table |
Name of the column |
columns
The view columns
contains information about all
table columns (or view columns) in the database. System columns
(ctid
, etc.) are not included. Only those columns are
shown that the current user has access to (by way of being the
owner or having some privilege).
Table 37.15. columns
Columns
Column Type Description |
---|
Name of the database containing the table (always the current database) |
Name of the schema containing the table |
Name of the table |
Name of the column |
Ordinal position of the column within the table (count starts at 1) |
Default expression of the column |
|
Data type of the column, if it is a built-in type, or
|
If |
If |
If |
If |
If |
If |
If |
Applies to a feature not available
in PostgreSQL
(see |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Name of the database containing the collation of the column (always the current database), null if default or the data type of the column is not collatable |
Name of the schema containing the collation of the column, null if default or the data type of the column is not collatable |
Name of the collation of the column, null if default or the data type of the column is not collatable |
If the column has a domain type, the name of the database that the domain is defined in (always the current database), else null. |
If the column has a domain type, the name of the schema that the domain is defined in, else null. |
If the column has a domain type, the name of the domain, else null. |
Name of the database that the column data type (the underlying type of the domain, if applicable) is defined in (always the current database) |
Name of the schema that the column data type (the underlying type of the domain, if applicable) is defined in |
Name of the column data type (the underlying type of the domain, if applicable) |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, because arrays always have unlimited maximum cardinality in PostgreSQL |
An identifier of the data type descriptor of the column, unique among the data type descriptors pertaining to the table. This is mainly useful for joining with other instances of such identifiers. (The specific format of the identifier is not defined and not guaranteed to remain the same in future versions.) |
Applies to a feature not available in PostgreSQL |
If the column is an identity column, then |
If the column is an identity column, then |
If the column is an identity column, then the start value of the internal sequence, else null. |
If the column is an identity column, then the increment of the internal sequence, else null. |
If the column is an identity column, then the maximum value of the internal sequence, else null. |
If the column is an identity column, then the minimum value of the internal sequence, else null. |
If the column is an identity column, then |
If the column is a generated column, then |
If the column is a generated column, then the generation expression, else null. |
|
Since data types can be defined in a variety of ways in SQL, and
PostgreSQL contains additional ways to
define data types, their representation in the information schema
can be somewhat difficult. The column data_type
is supposed to identify the underlying built-in type of the column.
In PostgreSQL, this means that the type
is defined in the system catalog schema
pg_catalog
. This column might be useful if the
application can handle the well-known built-in types specially (for
example, format the numeric types differently or use the data in
the precision columns). The columns udt_name
,
udt_schema
, and udt_catalog
always identify the underlying data type of the column, even if the
column is based on a domain. (Since
PostgreSQL treats built-in types like
user-defined types, built-in types appear here as well. This is an
extension of the SQL standard.) These columns should be used if an
application wants to process data differently according to the
type, because in that case it wouldn't matter if the column is
really based on a domain. If the column is based on a domain, the
identity of the domain is stored in the columns
domain_name
, domain_schema
,
and domain_catalog
. If you want to pair up
columns with their associated data types and treat domains as
separate types, you could write coalesce(domain_name,
udt_name)
, etc.
constraint_column_usage
The view constraint_column_usage
identifies all
columns in the current database that are used by some constraint.
Only those columns are shown that are contained in a table owned by
a currently enabled role. For a check constraint, this view
identifies the columns that are used in the check expression. For
a foreign key constraint, this view identifies the columns that the
foreign key references. For a unique or primary key constraint,
this view identifies the constrained columns.
Table 37.16. constraint_column_usage
Columns
Column Type Description |
---|
Name of the database that contains the table that contains the column that is used by some constraint (always the current database) |
Name of the schema that contains the table that contains the column that is used by some constraint |
Name of the table that contains the column that is used by some constraint |
Name of the column that is used by some constraint |
Name of the database that contains the constraint (always the current database) |
Name of the schema that contains the constraint |
Name of the constraint |
constraint_table_usage
The view constraint_table_usage
identifies all
tables in the current database that are used by some constraint and
are owned by a currently enabled role. (This is different from the
view table_constraints
, which identifies all
table constraints along with the table they are defined on.) For a
foreign key constraint, this view identifies the table that the
foreign key references. For a unique or primary key constraint,
this view simply identifies the table the constraint belongs to.
Check constraints and not-null constraints are not included in this
view.
Table 37.17. constraint_table_usage
Columns
Column Type Description |
---|
Name of the database that contains the table that is used by some constraint (always the current database) |
Name of the schema that contains the table that is used by some constraint |
Name of the table that is used by some constraint |
Name of the database that contains the constraint (always the current database) |
Name of the schema that contains the constraint |
Name of the constraint |
data_type_privileges
The view data_type_privileges
identifies all
data type descriptors that the current user has access to, by way
of being the owner of the described object or having some privilege
for it. A data type descriptor is generated whenever a data type
is used in the definition of a table column, a domain, or a
function (as parameter or return type) and stores some information
about how the data type is used in that instance (for example, the
declared maximum length, if applicable). Each data type
descriptor is assigned an arbitrary identifier that is unique
among the data type descriptor identifiers assigned for one object
(table, domain, function). This view is probably not useful for
applications, but it is used to define some other views in the
information schema.
Table 37.18. data_type_privileges
Columns
Column Type Description |
---|
Name of the database that contains the described object (always the current database) |
Name of the schema that contains the described object |
Name of the described object |
The type of the described object: one of
|
The identifier of the data type descriptor, which is unique among the data type descriptors for that same object. |
domain_constraints
The view domain_constraints
contains all constraints
belonging to domains defined in the current database. Only those domains
are shown that the current user has access to (by way of being the owner or
having some privilege).
Table 37.19. domain_constraints
Columns
Column Type Description |
---|
Name of the database that contains the constraint (always the current database) |
Name of the schema that contains the constraint |
Name of the constraint |
Name of the database that contains the domain (always the current database) |
Name of the schema that contains the domain |
Name of the domain |
|
|
domain_udt_usage
The view domain_udt_usage
identifies all domains
that are based on data types owned by a currently enabled role.
Note that in PostgreSQL, built-in data
types behave like user-defined types, so they are included here as
well.
Table 37.20. domain_udt_usage
Columns
Column Type Description |
---|
Name of the database that the domain data type is defined in (always the current database) |
Name of the schema that the domain data type is defined in |
Name of the domain data type |
Name of the database that contains the domain (always the current database) |
Name of the schema that contains the domain |
Name of the domain |
domains
The view domains
contains all domains defined in the
current database. Only those domains are shown that the current user has
access to (by way of being the owner or having some privilege).
Table 37.21. domains
Columns
Column Type Description |
---|
Name of the database that contains the domain (always the current database) |
Name of the schema that contains the domain |
Name of the domain |
Data type of the domain, if it is a built-in type, or
|
If the domain has a character or bit string type, the declared maximum length; null for all other data types or if no maximum length was declared. |
If the domain has a character type, the maximum possible length in octets (bytes) of a datum; null for all other data types. The maximum octet length depends on the declared character maximum length (see above) and the server encoding. |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Name of the database containing the collation of the domain (always the current database), null if default or the data type of the domain is not collatable |
Name of the schema containing the collation of the domain, null if default or the data type of the domain is not collatable |
Name of the collation of the domain, null if default or the data type of the domain is not collatable |
If the domain has a numeric type, this column contains the
(declared or implicit) precision of the type for this domain.
The precision indicates the number of significant digits. It
can be expressed in decimal (base 10) or binary (base 2) terms,
as specified in the column
|
If the domain has a numeric type, this column indicates in
which base the values in the columns
|
If the domain has an exact numeric type, this column contains
the (declared or implicit) scale of the type for this domain.
The scale indicates the number of significant digits to the
right of the decimal point. It can be expressed in decimal
(base 10) or binary (base 2) terms, as specified in the column
|
If |
If |
Applies to a feature not available
in PostgreSQL
(see |
Default expression of the domain |
Name of the database that the domain data type is defined in (always the current database) |
Name of the schema that the domain data type is defined in |
Name of the domain data type |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, because arrays always have unlimited maximum cardinality in PostgreSQL |
An identifier of the data type descriptor of the domain, unique among the data type descriptors pertaining to the domain (which is trivial, because a domain only contains one data type descriptor). This is mainly useful for joining with other instances of such identifiers. (The specific format of the identifier is not defined and not guaranteed to remain the same in future versions.) |
element_types
The view element_types
contains the data type
descriptors of the elements of arrays. When a table column, composite-type attribute,
domain, function parameter, or function return value is defined to
be of an array type, the respective information schema view only
contains ARRAY
in the column
data_type
. To obtain information on the element
type of the array, you can join the respective view with this view.
For example, to show the columns of a table with data types and
array element types, if applicable, you could do:
SELECT c.column_name, c.data_type, e.data_type AS element_type FROM information_schema.columns c LEFT JOIN information_schema.element_types e ON ((c.table_catalog, c.table_schema, c.table_name, 'TABLE', c.dtd_identifier) = (e.object_catalog, e.object_schema, e.object_name, e.object_type, e.collection_type_identifier)) WHERE c.table_schema = '...' AND c.table_name = '...' ORDER BY c.ordinal_position;
This view only includes objects that the current user has access to, by way of being the owner or having some privilege.
Table 37.22. element_types
Columns
Column Type Description |
---|
Name of the database that contains the object that uses the array being described (always the current database) |
Name of the schema that contains the object that uses the array being described |
Name of the object that uses the array being described |
The type of the object that uses the array being described: one
of |
The identifier of the data type descriptor of the array being
described. Use this to join with the
|
Data type of the array elements, if it is a built-in type, else
|
Always null, since this information is not applied to array element data types in PostgreSQL |
Always null, since this information is not applied to array element data types in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Name of the database containing the collation of the element type (always the current database), null if default or the data type of the element is not collatable |
Name of the schema containing the collation of the element type, null if default or the data type of the element is not collatable |
Name of the collation of the element type, null if default or the data type of the element is not collatable |
Always null, since this information is not applied to array element data types in PostgreSQL |
Always null, since this information is not applied to array element data types in PostgreSQL |
Always null, since this information is not applied to array element data types in PostgreSQL |
Always null, since this information is not applied to array element data types in PostgreSQL |
Always null, since this information is not applied to array element data types in PostgreSQL |
Always null, since this information is not applied to array element data types in PostgreSQL |
Not yet implemented |
Name of the database that the data type of the elements is defined in (always the current database) |
Name of the schema that the data type of the elements is defined in |
Name of the data type of the elements |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, because arrays always have unlimited maximum cardinality in PostgreSQL |
An identifier of the data type descriptor of the element. This is currently not useful. |
enabled_roles
The view enabled_roles
identifies the currently
“enabled roles”. The enabled roles are recursively
defined as the current user together with all roles that have been
granted to the enabled roles with automatic inheritance. In other
words, these are all roles that the current user has direct or
indirect, automatically inheriting membership in.
For permission checking, the set of “applicable roles”
is applied, which can be broader than the set of enabled roles. So
generally, it is better to use the view
applicable_roles
instead of this one; See
Section 37.5 for details on
applicable_roles
view.
Table 37.23. enabled_roles
Columns
Column Type Description |
---|
Name of a role |
foreign_data_wrapper_options
The view foreign_data_wrapper_options
contains
all the options defined for foreign-data wrappers in the current
database. Only those foreign-data wrappers are shown that the
current user has access to (by way of being the owner or having
some privilege).
Table 37.24. foreign_data_wrapper_options
Columns
Column Type Description |
---|
Name of the database that the foreign-data wrapper is defined in (always the current database) |
Name of the foreign-data wrapper |
Name of an option |
Value of the option |
foreign_data_wrappers
The view foreign_data_wrappers
contains all
foreign-data wrappers defined in the current database. Only those
foreign-data wrappers are shown that the current user has access to
(by way of being the owner or having some privilege).
Table 37.25. foreign_data_wrappers
Columns
Column Type Description |
---|
Name of the database that contains the foreign-data wrapper (always the current database) |
Name of the foreign-data wrapper |
Name of the owner of the foreign server |
File name of the library that implementing this foreign-data wrapper |
Language used to implement this foreign-data wrapper |
foreign_server_options
The view foreign_server_options
contains all the
options defined for foreign servers in the current database. Only
those foreign servers are shown that the current user has access to
(by way of being the owner or having some privilege).
Table 37.26. foreign_server_options
Columns
Column Type Description |
---|
Name of the database that the foreign server is defined in (always the current database) |
Name of the foreign server |
Name of an option |
Value of the option |
foreign_servers
The view foreign_servers
contains all foreign
servers defined in the current database. Only those foreign
servers are shown that the current user has access to (by way of
being the owner or having some privilege).
Table 37.27. foreign_servers
Columns
Column Type Description |
---|
Name of the database that the foreign server is defined in (always the current database) |
Name of the foreign server |
Name of the database that contains the foreign-data wrapper used by the foreign server (always the current database) |
Name of the foreign-data wrapper used by the foreign server |
Foreign server type information, if specified upon creation |
Foreign server version information, if specified upon creation |
Name of the owner of the foreign server |
foreign_table_options
The view foreign_table_options
contains all the
options defined for foreign tables in the current database. Only
those foreign tables are shown that the current user has access to
(by way of being the owner or having some privilege).
Table 37.28. foreign_table_options
Columns
Column Type Description |
---|
Name of the database that contains the foreign table (always the current database) |
Name of the schema that contains the foreign table |
Name of the foreign table |
Name of an option |
Value of the option |
foreign_tables
The view foreign_tables
contains all foreign
tables defined in the current database. Only those foreign
tables are shown that the current user has access to (by way of
being the owner or having some privilege).
Table 37.29. foreign_tables
Columns
Column Type Description |
---|
Name of the database that the foreign table is defined in (always the current database) |
Name of the schema that contains the foreign table |
Name of the foreign table |
Name of the database that the foreign server is defined in (always the current database) |
Name of the foreign server |
key_column_usage
The view key_column_usage
identifies all columns
in the current database that are restricted by some unique, primary
key, or foreign key constraint. Check constraints are not included
in this view. Only those columns are shown that the current user
has access to, by way of being the owner or having some privilege.
Table 37.30. key_column_usage
Columns
Column Type Description |
---|
Name of the database that contains the constraint (always the current database) |
Name of the schema that contains the constraint |
Name of the constraint |
Name of the database that contains the table that contains the column that is restricted by this constraint (always the current database) |
Name of the schema that contains the table that contains the column that is restricted by this constraint |
Name of the table that contains the column that is restricted by this constraint |
Name of the column that is restricted by this constraint |
Ordinal position of the column within the constraint key (count starts at 1) |
For a foreign-key constraint, ordinal position of the referenced column within its unique constraint (count starts at 1); otherwise null |
parameters
The view parameters
contains information about
the parameters (arguments) of all functions in the current database.
Only those functions are shown that the current user has access to
(by way of being the owner or having some privilege).
Table 37.31. parameters
Columns
Column Type Description |
---|
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
Ordinal position of the parameter in the argument list of the function (count starts at 1) |
|
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Name of the parameter, or null if the parameter has no name |
Data type of the parameter, if it is a built-in type, or
|
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Always null, since this information is not applied to parameter data types in PostgreSQL |
Name of the database that the data type of the parameter is defined in (always the current database) |
Name of the schema that the data type of the parameter is defined in |
Name of the data type of the parameter |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, because arrays always have unlimited maximum cardinality in PostgreSQL |
An identifier of the data type descriptor of the parameter, unique among the data type descriptors pertaining to the function. This is mainly useful for joining with other instances of such identifiers. (The specific format of the identifier is not defined and not guaranteed to remain the same in future versions.) |
The default expression of the parameter, or null if none or if the function is not owned by a currently enabled role. |
referential_constraints
The view referential_constraints
contains all
referential (foreign key) constraints in the current database.
Only those constraints are shown for which the current user has
write access to the referencing table (by way of being the
owner or having some privilege other than SELECT
).
Table 37.32. referential_constraints
Columns
Column Type Description |
---|
Name of the database containing the constraint (always the current database) |
Name of the schema containing the constraint |
Name of the constraint |
Name of the database that contains the unique or primary key constraint that the foreign key constraint references (always the current database) |
Name of the schema that contains the unique or primary key constraint that the foreign key constraint references |
Name of the unique or primary key constraint that the foreign key constraint references |
Match option of the foreign key constraint:
|
Update rule of the foreign key constraint:
|
Delete rule of the foreign key constraint:
|
role_column_grants
The view role_column_grants
identifies all
privileges granted on columns where the grantor or grantee is a
currently enabled role. Further information can be found under
column_privileges
. The only effective
difference between this view
and column_privileges
is that this view omits
columns that have been made accessible to the current user by way
of a grant to PUBLIC
.
Table 37.33. role_column_grants
Columns
Column Type Description |
---|
Name of the role that granted the privilege |
Name of the role that the privilege was granted to |
Name of the database that contains the table that contains the column (always the current database) |
Name of the schema that contains the table that contains the column |
Name of the table that contains the column |
Name of the column |
Type of the privilege: |
|
role_routine_grants
The view role_routine_grants
identifies all
privileges granted on functions where the grantor or grantee is a
currently enabled role. Further information can be found under
routine_privileges
. The only effective
difference between this view
and routine_privileges
is that this view omits
functions that have been made accessible to the current user by way
of a grant to PUBLIC
.
Table 37.34. role_routine_grants
Columns
Column Type Description |
---|
Name of the role that granted the privilege |
Name of the role that the privilege was granted to |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
Name of the function (might be duplicated in case of overloading) |
Always |
|
role_table_grants
The view role_table_grants
identifies all
privileges granted on tables or views where the grantor or grantee
is a currently enabled role. Further information can be found
under table_privileges
. The only effective
difference between this view
and table_privileges
is that this view omits
tables that have been made accessible to the current user by way of
a grant to PUBLIC
.
Table 37.35. role_table_grants
Columns
Column Type Description |
---|
Name of the role that granted the privilege |
Name of the role that the privilege was granted to |
Name of the database that contains the table (always the current database) |
Name of the schema that contains the table |
Name of the table |
Type of the privilege: |
|
In the SQL standard, |
role_udt_grants
The view role_udt_grants
is intended to identify
USAGE
privileges granted on user-defined types
where the grantor or grantee is a currently enabled role. Further
information can be found under
udt_privileges
. The only effective difference
between this view and udt_privileges
is that
this view omits objects that have been made accessible to the
current user by way of a grant to PUBLIC
. Since
data types do not have real privileges in PostgreSQL, but only an
implicit grant to PUBLIC
, this view is empty.
Table 37.36. role_udt_grants
Columns
Column Type Description |
---|
The name of the role that granted the privilege |
The name of the role that the privilege was granted to |
Name of the database containing the type (always the current database) |
Name of the schema containing the type |
Name of the type |
Always |
|
role_usage_grants
The view role_usage_grants
identifies
USAGE
privileges granted on various kinds of
objects where the grantor or grantee is a currently enabled role.
Further information can be found under
usage_privileges
. The only effective difference
between this view and usage_privileges
is that
this view omits objects that have been made accessible to the
current user by way of a grant to PUBLIC
.
Table 37.37. role_usage_grants
Columns
Column Type Description |
---|
The name of the role that granted the privilege |
The name of the role that the privilege was granted to |
Name of the database containing the object (always the current database) |
Name of the schema containing the object, if applicable, else an empty string |
Name of the object |
|
Always |
|
routine_column_usage
The view routine_column_usage
identifies all columns
that are used by a function or procedure, either in the SQL body or in
parameter default expressions. (This only works for unquoted SQL bodies,
not quoted bodies or functions in other languages.) A column is only
included if its table is owned by a currently enabled role.
Table 37.38. routine_column_usage
Columns
Column Type Description |
---|
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
Name of the function (might be duplicated in case of overloading) |
Name of the database that contains the table that is used by the function (always the current database) |
Name of the schema that contains the table that is used by the function |
Name of the table that is used by the function |
Name of the column that is used by the function |
routine_privileges
The view routine_privileges
identifies all
privileges granted on functions to a currently enabled role or by a
currently enabled role. There is one row for each combination of function,
grantor, and grantee.
Table 37.39. routine_privileges
Columns
Column Type Description |
---|
Name of the role that granted the privilege |
Name of the role that the privilege was granted to |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
Name of the function (might be duplicated in case of overloading) |
Always |
|
routine_routine_usage
The view routine_routine_usage
identifies all functions
or procedures that are used by another (or the same) function or procedure,
either in the SQL body or in parameter default expressions. (This only
works for unquoted SQL bodies, not quoted bodies or functions in other
languages.) An entry is included here only if the used function is owned
by a currently enabled role. (There is no such restriction on the using
function.)
Note that the entries for both functions in the view refer to the “specific” name of the routine, even though the column names are used in a way that is inconsistent with other information schema views about routines. This is per SQL standard, although it is arguably a misdesign. See Section 37.45 for more information about specific names.
Table 37.40. routine_routine_usage
Columns
Column Type Description |
---|
Name of the database containing the using function (always the current database) |
Name of the schema containing the using function |
The “specific name” of the using function. |
Name of the database that contains the function that is used by the first function (always the current database) |
Name of the schema that contains the function that is used by the first function |
The “specific name” of the function that is used by the first function. |
routine_sequence_usage
The view routine_sequence_usage
identifies all sequences
that are used by a function or procedure, either in the SQL body or in
parameter default expressions. (This only works for unquoted SQL bodies,
not quoted bodies or functions in other languages.) A sequence is only
included if that sequence is owned by a currently enabled role.
Table 37.41. routine_sequence_usage
Columns
Column Type Description |
---|
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
Name of the function (might be duplicated in case of overloading) |
Name of the database that contains the sequence that is used by the function (always the current database) |
Name of the schema that contains the sequence that is used by the function |
Name of the sequence that is used by the function |
routine_table_usage
The view routine_table_usage
is meant to identify all
tables that are used by a function or procedure. This information is
currently not tracked by PostgreSQL.
Table 37.42. routine_table_usage
Columns
Column Type Description |
---|
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. See Section 37.45 for more information. |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
Name of the function (might be duplicated in case of overloading) |
Name of the database that contains the table that is used by the function (always the current database) |
Name of the schema that contains the table that is used by the function |
Name of the table that is used by the function |
routines
The view routines
contains all functions and procedures in the
current database. Only those functions and procedures are shown that the current
user has access to (by way of being the owner or having some
privilege).
Table 37.43. routines
Columns
Column Type Description |
---|
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
The “specific name” of the function. This is a name that uniquely identifies the function in the schema, even if the real name of the function is overloaded. The format of the specific name is not defined, it should only be used to compare it to other instances of specific routine names. |
Name of the database containing the function (always the current database) |
Name of the schema containing the function |
Name of the function (might be duplicated in case of overloading) |
|
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Return data type of the function, if it is a built-in type, or
|
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Always null, since this information is not applied to return data types in PostgreSQL |
Name of the database that the return data type of the function is defined in (always the current database). Null for a procedure. |
Name of the schema that the return data type of the function is defined in. Null for a procedure. |
Name of the return data type of the function. Null for a procedure. |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Always null, because arrays always have unlimited maximum cardinality in PostgreSQL |
An identifier of the data type descriptor of the return data type of this function, unique among the data type descriptors pertaining to the function. This is mainly useful for joining with other instances of such identifiers. (The specific format of the identifier is not defined and not guaranteed to remain the same in future versions.) |
If the function is an SQL function, then
|
The source text of the function (null if the function is not
owned by a currently enabled role). (According to the SQL
standard, this column is only applicable if
|
If this function is a C function, then the external name (link
symbol) of the function; else null. (This works out to be the
same value that is shown in
|
The language the function is written in |
Always |
If the function is declared immutable (called deterministic in
the SQL standard), then |
Always |
If the function automatically returns null if any of its
arguments are null, then |
Applies to a feature not available in PostgreSQL |
Always |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
If the function runs with the privileges of the current user,
then |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Currently always |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
Applies to a feature not available in PostgreSQL |
schemata
The view schemata