Franck Pachot for MongoDB

Posted on Jul 8 • Edited on Jul 21

ALTER TABLE ... ADD COLUMN

#postgres #mongodb #database #performance

MongoDB’s flexible schema allows each document within a collection to have a unique structure, a feature known as polymorphism. This contrasts with SQL databases, where every row in a table must adhere to a predefined set of columns. To support polymorphic data models without multiplying tables, SQL schemas often include many optional columns, which are frequently left null. These null values usually signify "not applicable" rather than "unknown". Furthermore, unused columns must be explicitly defined in the database schema, and modifying this structure typically requires locking the table.

This rigidity is often a point of comparison, favoring MongoDB, where applications can introduce new schemas without affecting existing objects. A common example was the ALTER TABLE ADD COLUMN statement, which locks the table because DDL operations must modify the catalog information shared by all table rows. It was often mentionned to illustrate the rigidity of RDBMS. When updating existing rows, this lock could last a long time, causing significant delays.
However, while many ALTER TABLE operations still require rewriting the table, such as changing a data type, adding a column is no longer one of them. Most SQL databases now optimize adding a column as a metadata-only operation, making it faster and more efficient than before. The main difference with MongoDB is how this change is controlled by the database administrator or the application developer.

This blog post explains that mentioning ALTER TABLE ADD COLUMN to showcase schema flexibility is not ideal because it has been optimized in many RDBMS. It is the occasion to explain how it works internally in PostgreSQL, and that it is similar to what developers do with MongoDB.

Example on PostgreSQL

Adding a column requires an exclusive lock, but since PostgreSQL 11 this lock duration is now very short (if it doesn't have itself to wait on another lock).

For example, I created a table with ten million rows:

\timing on
Timing is on.

create table iloveddl ( id bigint );
CREATE TABLE
Time: 2.026 ms

insert into iloveddl select generate_series(1,1e7);
INSERT 0 10000000
Time: 31328.019 ms (00:31.328)

PostgreSQL updates are more costly than inserts, so rewriting a table can take several minutes. I use this to verify that my ALTER TABLE statements do not trigger a re-write of the rows.

Here is the first row in the table:

select * from iloveddl where id <2;
 id 
----
  1
(1 row)

Time: 839.856 ms

The column information is stored in the catalog and can be viewed in the pg_attribute table. To retrieve details such as the column name, number, nullability, and nullable or missing value, you can run the following SQL query:

select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
 from pg_attribute
 where attrelid='iloveddl'::regclass
;                                         

 attname  | attnum | attnotnull | atthasdef | atthasmissing | attmissingval 
----------+--------+------------+-----------+---------------+---------------
 tableoid |     -6 | t          | f         | f             | 
 cmax     |     -5 | t          | f         | f             | 
 xmax     |     -4 | t          | f         | f             | 
 cmin     |     -3 | t          | f         | f             | 
 xmin     |     -2 | t          | f         | f             | 
 ctid     |     -1 | t          | f         | f             | 
 id       |      1 | f          | f         | f             | 
(7 rows)

Time: 0.569 ms

In addition to the system columns, there is an entry for the single column I defined in CREATE TABLE: "id". All flags are set to false: this column is nullable, has no default value, and lacks a value to replace a missing one. In SQL databases, a missing value, indicating no value exists, is different from a null, which signifies an unknown value. The missing value flag is used to optimize adding columns with default values without needing to rewrite all rows.

NULL with no DEFAULT

To add a nullable column without a default value in PostgreSQL, the following SQL command adds a new column named "x" of type integer to the "iloveddl" table. Since no default value is specified and the column is nullable by default, it will contain NULL for existing rows:

alter table iloveddl add column x int;

ALTER TABLE
Time: 1.760 ms

This was a quick operation, involving only metadata, without rewriting the rows. Existing rows have no information about this new column, but on query, PostgreSQL adds this column to the result with a null value:

select * from iloveddl where id <2;

 id | x 
----+---
  1 |  
(1 row)

Time: 206.647 ms

This is a common scenario where the SQL semantics are straightforward: when a new column is added, it logically exists for all rows, but its value is unknown for those created before the column addition. Unknown values are represented as NULL. Prior to the ALTER TABLE command, the column did not exist. Afterward, it exists with an unknown value.

Although the ALTER TABLE ADD COLUMN operation affects all rows logically, it doesn't need to modify physical rows. Instead, the database catalog marks the existence of the new column for all rows, and at runtime, a physical non-existence is interpreted as a logical NULL.

PostgreSQL has simply added the column in the dictionary, as nullable (attnotnull):

select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
 from pg_attribute
 where attrelid='iloveddl'::regclass
;  

 attname  | attnum | attnotnull | atthasdef | atthasmissing | attmissingval 
----------+--------+------------+-----------+---------------+---------------
 tableoid |     -6 | t          | f         | f             | 
 cmax     |     -5 | t          | f         | f             | 
 xmax     |     -4 | t          | f         | f             | 
 cmin     |     -3 | t          | f         | f             | 
 xmin     |     -2 | t          | f         | f             | 
 ctid     |     -1 | t          | f         | f             | 
 id       |      1 | f          | f         | f             | 
 x        |      2 | f          | f         | f             | 
(8 rows)

Time: 0.510 ms

With a simple insert into the catalog, PostgreSQL changed the definition of all rows, from inexisting "x" to existing with unknown value.
For example, in a small business scenario, you may have stored only customers' names and emails. If you then decide to add a 'date of birth' column, this information was likely always existing but previously unrecorded. After the addition, the date of birth for existing customers will appear as NULL, indicating that the value is unknown.

NULL with DEFAULT value

SQL allows an insert to omit a column, setting its value to NULL by default, which indicates the value is unknown at insert time. However, SQL developers can specify that omitted columns should instead take a default value. The column remains nullable, meaning it can be explicitly set to NULL in an insert statement, but the absence of a value is different than a null assignment, and must set the default value.

When adding a column with a default value in PostgreSQL, existing rows are treated as if they were inserted on the table with the new column, but unspecified by the insert statement. The new column must now return the default value for the existing rows. Before PostgreSQL 11, the ALTER TABLE command had to write this default value into every row, which could be slow. This limitation was used to illustrate the rigidity of schema changes in SQL databases.

Currently, this particular case is optimized and performs quickly in the latest versions:

alter table iloveddl add column y int null default 42;

ALTER TABLE
Time: 2.802 ms

The default value used for future inserts is stored in the pg_attrdef table. You can retrieve this information with the following SQL query:

select * from pg_attrdef 
 where adrelid='iloveddl'::regclass
;

  oid   | adrelid | adnum |                                                                        adbin                                                      

--------+---------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------
 346434 |  346431 |     3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :___location -1 :constvalue 4 [ 42 0 0 0 0 0 0 0 ]}
(1 row)

Time: 0.413 ms

This is used for future inserts, but existing rows show the same value, which was virtually set in the catalog to avoid re-writing all rows during the ALTER TABLE statement:

select * from iloveddl where id <2;

 id | x | y  
----+---+----
  1 |   | 42
(1 row)

Time: 189.657 ms

You can change the default value for future inserts without altering the existing data or the current value of the column. Here is an example:

alter table iloveddl alter column y set default 100;

ALTER TABLE
Time: 2.039 ms

insert into iloveddl(id) values (-1);

INSERT 0 1
Time: 1.516 ms

select * from iloveddl where id <2;

 id | x |  y  
----+---+-----
  1 |   |  42
 -1 |   | 100
(2 rows)

Time: 207.727 ms

This change affects only new rows. For example, after setting the default to 100, inserting a row without specifying "y" will automatically assign 100. It does not alter existing rows. For instance, the rows that existed before adding the column will still show their previous "y" values, like 42.

The default value has been changed in pg_attrdef, to 100, which proves that the value for existing rows, 42, is stored elewhere:

select * from pg_attrdef 
 where adrelid='iloveddl'::regclass
;

  oid   | adrelid | adnum |                                                                         adbin                                                                         
--------+---------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------
 346435 |  346431 |     3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :___location -1 :constvalue 4 [ 100 0 0 0 0 0 0 0 ]}
(1 row)

Time: 0.568 ms

PostgreSQL appears to set columns with default values as if all rows have that value, but this is done logically rather than physically. It stores extra information in pg_attribute to assign a default value for columns that are missing when reading a table row, ensuring consistent behavior without physically altering each row:

select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
 from pg_attribute
 where attrelid='iloveddl'::regclass
;  

 attname  | attnum | attnotnull | atthasdef | atthasmissing | attmissingval 
----------+--------+------------+-----------+---------------+---------------
 tableoid |     -6 | t          | f         | f             | 
 cmax     |     -5 | t          | f         | f             | 
 xmax     |     -4 | t          | f         | f             | 
 cmin     |     -3 | t          | f         | f             | 
 xmin     |     -2 | t          | f         | f             | 
 ctid     |     -1 | t          | f         | f             | 
 id       |      1 | f          | f         | f             | 
 x        |      2 | f          | f         | f             | 
 y        |      3 | f          | t         | t             | {42}
(9 rows)

Time: 0.529 ms

When PostgreSQL reads a row from a table and finds that the column "y" does not exist, it checks the null bitmap. If "y" is absent there as well, PostgreSQL adds the column to the result with the default value stored in the catalog as attmissingval.

This optimization is only applicable when the default value is a constant. For scenarios like adding an expiration date to passwords—such as forcing users to change their password annually—it makes sense to set a default for existing users to the next year from the current date. This works only if there's the same expiration date for all users.

NOT NULL

It is possible to add a non-nullable column, but a default value is required. Otherwise, it would result in a null in a non-nullable column:

alter table iloveddl add column z int not null
;

ERROR:  column "z" of relation "iloveddl" contains null values
Time: 1.024 ms

alter table iloveddl add column z int not null default 42
;

ALTER TABLE
Time: 2.322 ms

This was quick, a metadata-only change. With NOT NULL DEFAULT, either the value is set physically in the row, or it is absent and the value comes from the catalog:

select * from pg_attrdef 
where adrelid='iloveddl'::regclass
;

  oid   | adrelid | adnum |                                                                         adbin                                                                         
--------+---------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------
 346435 |  346431 |     3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :___location -1 :constvalue 4 [ 100 0 0 0 0 0 0 0 ]}
 346436 |  346431 |     4 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :___location -1 :constvalue 4 [ 42 0 0 0 0 0 0 0 ]}
(2 rows)

Time: 0.503 ms

select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
 from pg_attribute
 where attrelid='iloveddl'::regclass
;

 attname  | attnum | attnotnull | atthasdef | atthasmissing | attmissingval 
----------+--------+------------+-----------+---------------+---------------
 tableoid |     -6 | t          | f         | f             | 
 cmax     |     -5 | t          | f         | f             | 
 xmax     |     -4 | t          | f         | f             | 
 cmin     |     -3 | t          | f         | f             | 
 xmin     |     -2 | t          | f         | f             | 
 ctid     |     -1 | t          | f         | f             | 
 id       |      1 | f          | f         | f             | 
 x        |      2 | f          | f         | f             | 
 y        |      3 | f          | t         | t             | {42}
 z        |      4 | t          | t         | t             | {42}
(10 rows)

Time: 0.465 ms

If I insert a new row without mentionning the new column, it is set with the current default value from pg_attrdef, and stored with it. If I query a row that was inserted before, it shows the value from pg_attribute.attmissingval:

insert into iloveddl(id,y) values (-2,null)
;
INSERT 0 1
Time: 8.692 ms

select * from iloveddl where id <2
;
 id | x |  y  | z  
----+---+-----+----
  1 |   |  42 | 42
 -1 |   | 100 | 42
 -2 |   |     | 42
(3 rows)

Time: 195.648 ms

Summary of PostgreSQL behavior

The DEFAULT clause used in an ALTER TABLE ADD COLUMN statement serves two different purposes:

Schema on write: When new rows are inserted without specifying a value for this column, the DEFAULT value is automatically assigned, functioning similarly to a trigger or a stored generated column.
Schema on read: When querying rows that lack a value in this column, the DEFAULT value appears in the result set, similar to a view or a virtual column.

In PostgreSQL, there are three cases when reading a row. First, if a column is present in the null bitmap and flagged as null, the value returned is NULL. Second, if the column is present and flagged as not null, the actual value is returned. Lastly, if the column is not present in the null bitmap, the system returns a predefined value called attmissingval.

Comparison with MongoDB

In MongoDB, some part of the document structure can be defined using indexes and schema validation (schema-on-write). However, the application is free to add new fields to documents without impacting other objects, and interpret it in the application code (schema-on-read).

When a default value is needed, it is the application's responsibility to write it, rather than relying on a rule deployed in the database. If the absence of a field should be interpreted as another value, the application handles this with application code. In general, the logic is more complex than a single constant for all existing data, and may depend on other fields. Aggregation pipeline can code the logic so that it runs efficiently in the database.

I create a collection similar to my PostgreSQL table:

for (let i = 1; i <= 10_000_000; i += 1000)
 db.iloveddl.insertMany(Array.from(
  {length: 1000},
  (_,k) => ({_id: i+k})
 ))
;

Here is how the first document looks like:

db.iloveddl.find({_id:{$lt:2}})

[ { _id: 1 } ]

The equivalent of ADD COLUMN NULL with no default is simply defining the field for new documents when needed, with nothing to do on existing documents because in MongoDB an inexisting field is the same as null.

New documents can set the field and if there is a default value, the application will set it:

db.iloveddl.insertOne({ _id: -1, y: 100 })  

{ acknowledged: true, insertedId: -1 }

db.iloveddl.find({_id:{$lt:2}})

[ { _id: -1, y: 100 }, { _id: 1 } ]

SQL was designed primarily for non-programmers using the command line to manipulate data. In this context, it is important to deploy defaults and check constraints directly into the database. MongoDB is designed for databases where data manipulation is handled through application code, and developers prefer to centralize all logic within the application, where it can be reviewed and tested more easily. Typically, the document inserted into MongoDB comes from an application object, and defaults were set in the constructor.

The equivalent of ADD COLUMN NULL with DEFAULT is like in PostgreSQL, adding the field on read, except that it is done by the application code rather than a declaration in the database catalog:

db.iloveddl.aggregate([
  { $sort: { _id:1 } },
  { $project: {
       _id: 1,
        y: { $ifNull: [ "$y", 42 ] }
  } }
])

[
  { _id: -1, y: 100 }, { _id: 1, y: 42 },
  { _id: 2, y: 42 },   { _id: 3, y: 42 },
  { _id: 4, y: 42 },   { _id: 5, y: 42 },
  { _id: 6, y: 42 },   { _id: 7, y: 42 },
  { _id: 8, y: 42 },   { _id: 9, y: 42 },
  { _id: 10, y: 42 },  { _id: 11, y: 42 },
  { _id: 12, y: 42 },  { _id: 13, y: 42 },
  { _id: 14, y: 42 },  { _id: 15, y: 42 },
  { _id: 16, y: 42 },  { _id: 17, y: 42 },
  { _id: 18, y: 42 },  { _id: 19, y: 42 }
]
Type "it" for more

If the logic for existing documents is more complex, like depending on other fields, it can be done the same way:

db.iloveddl.aggregate([  
  { $sort: { _id: 1 } },  
  { $project: {  
      _id: 1,  
      y: {  
        $ifNull: [  
          "$y",  
          {  
            $cond: [  
              { $eq: [ { $mod: [ "$_id", 2 ] }, 0 ] },   
              42,  // set 42 to "y" when "_id" is even
              99   // set 42 to "y" when "_id" is odd
            ]  
          }  
        ]  
      }  
    }  
  }  
])

Like in PostgreSQL, such schema-on-read transformation can be implemented in the database as a view:

db.createView(  
  "iloveddl_conditional_y",   // Name of the view  
  "iloveddl",                 // Source collection  
  [  
    { $sort: { _id: 1 } },  
    { $project: {   
        _id: 1,  
        y: {  
          $ifNull: [  
            "$y",  
            {  
              $cond: [  
                { $eq: [ { $mod: [ "$_id", 2 ] }, 0 ] },  
                42,     // _id is even => 42  
                99      // _id is odd  => 99  
              ]  
            }  
          ]  
        }  
      }  
    }  
  ]  
)  

{ ok: 1 }

db.iloveddl_conditional_y.find();
[
  { _id: -1, y: 100 }, { _id: 1, y: 99 },
  { _id: 2, y: 42 },   { _id: 3, y: 99 },
  { _id: 4, y: 42 },   { _id: 5, y: 99 },
  { _id: 6, y: 42 },   { _id: 7, y: 99 },
  { _id: 8, y: 42 },   { _id: 9, y: 99 },
  { _id: 10, y: 42 },  { _id: 11, y: 99 },
  { _id: 12, y: 42 },  { _id: 13, y: 99 },
  { _id: 14, y: 42 },  { _id: 15, y: 99 },
  { _id: 16, y: 42 },  { _id: 17, y: 99 },
  { _id: 18, y: 42 },  { _id: 19, y: 99 }
]
Type "it" for more

Finally, if it is preferable to set the field in all existing documents, a simple update can be used:

db.iloveddl.updateMany(  
  { y: { $exists: false } },  
  [  
    { $set: {  
        y: {  
          $cond: [  
            { $eq: [ { $mod: [ "$_id", 2 ] }, 0 ] },  
                42,     // _id is even => 42  
                99      // _id is odd  => 99  
          ]  
        }  
      }  
    }  
  ]  
)  

{
  acknowledged: true,
  insertedId: null,
  matchedCount: 10000000,
  modifiedCount: 10000000,
  upsertedCount: 0
}

db.iloveddl.find().sort({_id:1});

[
  { _id: -1, y: 100 }, { _id: 1, y: 99 },
  { _id: 2, y: 42 },   { _id: 3, y: 99 },
  { _id: 4, y: 42 },   { _id: 5, y: 99 },
  { _id: 6, y: 42 },   { _id: 7, y: 99 },
  { _id: 8, y: 42 },   { _id: 9, y: 99 },
  { _id: 10, y: 42 },  { _id: 11, y: 99 },
  { _id: 12, y: 42 },  { _id: 13, y: 99 },
  { _id: 14, y: 42 },  { _id: 15, y: 99 },
  { _id: 16, y: 42 },  { _id: 17, y: 99 },
  { _id: 18, y: 42 },  { _id: 19, y: 99 }
]
Type "it" for more

Unlike PostgreSQL, which copies the entire row and updates all indexes when writing, MongoDB adds the field in memory and writes the documents back to disk only during checkpoints. This approach streamlines updates by avoiding full row copying and defers disk writes, potentially improving performance. Such update is typically designed by the application team, synchronized with application released, so that the field exists before the new application version uses it. With RDBMS, such schema modification often requires synchronization between teams, like developers and DBAs.

Unlike SQL, where the DEFAULT clause specify a value used to read existing rows and to write future rows, MongoDB handles these differently. About the NOT NULL constraint, you can enforce the presence of a field in future documents using schema validation:

db.runCommand({
  collMod: "iloveddl",
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: [ "z" ],
      properties: {
        z: { bsonType: "int" }
      }
    }
  },
  validationLevel: "strict"
})

In summary, the overall approach across databases emphasizes minimizing unnecessary modifications to existing documents when desired and possible. The key difference between SQL databases and MongoDB lies in how this logic is implemented and deployed. SQL databases like PostgreSQL require updating a central schema shared across all table rows, ensuring the same logical structure for all rows in a table. In contrast, MongoDB integrates the logic within the application itself, and follows the same software development practices, like coding, review, and testing in a single team.

DEV Community