query optimizations, quick recipes

Daniel Molina Wegener — Fri, 02 Jul 2010 13:13:24 +0000

Optimizing SQL queries is quite simple task, you just need to follow some rules on how you sort the filtering criteria and how you use your select statements. I will try to explain how you can optimize your queries in this post.

named columns

Use named columns, instead of using the * wildcard. For example:

select
    t.*
from
    tbl as t
where
    t.tab_date > current_date - interval 1 month
order by
    t.tab_date desc;

Will be more slower than:

select
    t.tab_id,
    t.tab_state,
    t.tab_date,
    t.tab_name
from
    tbl as t
where
    t.tab_date > current_date - interval 1 month
order by
    t.tab_date desc

join filtering

If you filter on join, your query will be faster, for example:

select
    t1.tab_id,
    coalesce(t2.tab_date, current_date) as tab_date,
    t1.tab_name
from
    table1 as t1
        left join table2 as t2 on (t1.tab_id = t2.tab_tab_id)
where
    t2.tab_date IS NULL or t2.tab_date > current_date - interval 1 month;

Will be more slower than

select
    t1.tab_id,
    coalesce(t2.tab_date, current_date) as tab_date,
    t1.tab_name
from
    table1 as t1
        left join table2 as t2 on (t1.tab_id = t2.tab_tab_id
            and t2.tab_date > current_date - interval 1 month);

smaller set first

The smaller set must selected first, for example:

select
    t1.tab_id,
    coalesce(t2.tab_date, current_date) as tab_date,
    t1.tab_name
from
    table1 as t1
        left join table2 as t2 on (t2.tab_tab_id = t1.tab_id
            and t2.tab_date > current_date - interval 1 month);

Will be more slower than:

select
    t1.tab_id,
    coalesce(t2.tab_date, current_date) as tab_date,
    t1.tab_name
from
    table1 as t1
        left join table2 as t2 on (t1.tab_id = t2.tab_tab_id
            and t2.tab_date > current_date - interval 1 month);

Since t1 has less rows than t2, and t2 is referencing t1. Assuming that t1 has a smaller set of rows and is the referenced table. This applies to column selection too and where statements.

notes

You must measure the time of your queries, so you can have a more precise query. Those rules are not absolute.

© Daniel Molina Wegener for coder . cl, 2010. | Permalink | No comment | Add to del.icio.us
Post tags:

Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

database usage rules

Daniel Molina Wegener — Fri, 23 Apr 2010 14:39:55 +0000

Modelling databases is not a simple task. Many times we found some misspelled words, mixed case words and some horrifying things on many databases of third party systems. My rules are simple to follow, and most of them ensures a well designed database, with special care on how do we query that database, creating reliable and optimised for speed table design and queries.

table design

You must follow a simple rule, fixed length data types goes first, from the one that fits with CPU register length to those types that have variable length. For example a a simple table design for MySQL could be as follows:

CREATE TABLE customer
(
    customer_id bigint(11) NOT NULL AUTO_INCREMENT,
    cus_priority int NOT NULL,
    cus_state int NOT NULL,
    cus_products int NOT NULL,
    cus_date datetime NOT NULL,
    cus_name varchar(100) NOT NULL,
    cus_email varchar(50) NOT NULL,
    PRIMARY KEY USING BTREE (customer_id)
)
ENGINE= InnoDB DEFAULT CHARSET= utf8

If you take a look on the column order, fixed length table keys have the first priority, fixed length data types have the second priority and the last priority on the table creation script is for those types with variable length. From the perspective of the RDBMS system that we are using, the table structure is read the first time that we query the table, then each fixed length column is read in sequence as the RDBMS systems reads each row from the disk or memory, hence each non-fixed length column is read consequently each time that a row is handled, reading the column length first and then skipping the column offset on the disk or memory to handle the next column. So, reading non-fixed length columns is hard for the RDBMS.

Imagine those tables created with variable length keys!, those tables are creating a heavy overhead on any kind of table created. So, try to use table keys of fixed length, such as int and bigint. Including those tables which have a fixed length made of char(n), are creating a heavy overhead, since those tables are using string comparison algorithms instead of memory block comparison, such as comparing integer types, which is more easy to handle.

Using a naming convention matters too. The table above has a lowercase name, lower case table fields and each field has a prefix related to the table. Why does they have a prefix?, while we are querying a table is more comfortable to identify them by the prefix that they have, specially when we are creating joined queries, allowing unique column identifiers.

query optimisation

Some older RDBMS systems do not support the usage of the join keyword, so we must create joined queries using the where statement. But nowdays in most RDBMS systems supports the join keyword, allowing more faster queries, so there is a technique allowing us to create joins from the older way of using where statements.

select
    pro.product_id,
    pro.pro_name,
    sum(coalesce(pay.pay_price, 0.0)) as payment
from
    product as pro
        left join subscription as sus on (pro.product_id = sus.product_id)
        left join payment as pay on (sus.payment_id = pay.payment_id)
group by
    pro.product_id,
    pro.pro_name
order by
    pro.product_id,
    pro.pro_name

The query above will bring us a report about payments related to each product on the database. If you see there is no where statement. Also, examine the join conditionals, the smallest dataset is filtered at the left side of the conditional. The more we filter on the join conditionals, faster is the query result. That query took just 0.002 seconds. Let’s see what happens with a non-joined query and using the where statement.

select
    pro.product_id,
    pro.pro_name,
    sum(coalesce(pay.pay_price, 0.0)) as payments
from
    product as pro,
    subscription as sus,
    payment as pay
where
    (pro.product_id = sus.product_id and sus.product_id is not null)
    and (pro.product_id = sus.product_id or sus.product_id is null)
    and (sus.payment_id = pay.payment_id or pay.payment_id is null)
group by
    pro.product_id,
    pro.pro_name
order by
    pro.product_id,
    pro.pro_name

Using the where statement, the same query took 0.012 seconds. What happened behind on the RDBMS side? On the first query, row filtering was made while the RDBMS was reading rows from each table, hence on the second query, the selection was made over the complete dataset retrieved from all selection tables. In this case we have an enhancement of 600% on performance, on more complex queries the same happens. Also the having clause is evil too, since it process those rows retrieved after the selection process on those rows retrieved on at last.


select
    pro.product_id,
    pro.pro_name,
    sum(coalesce(pay.pay_price, 0.0)) as payment
from
    product as pro
        left join subscription as sus on (pro.product_id = sus.product_id)
        left join payment as pay on (sus.payment_id = pay.payment_id)
group by
    pro.product_id,
    pro.pro_name
having
    payment > 200000
order by
    pro.product_id,
    pro.pro_name

The having clause can process aggregate columns, but that processing is made at end of the selection process, so you must try to replace that having clause by other join or where statement if you have an option to do that.

rdbms tuning

Each RDBMS has its own parameters. For example, MySQL have some parameters which allow you to have entire tables on memory, so most queries to those tables in memory are faster than reading the hard drive. If you have a system that is querying a database concurrently, you should increase the memory limit, allowing the RDBMS to maintain that table on memory. You must research which options fits for better on the RDBMS that you are using, which kind of keys and table format fits better with your needs. For example on MySQL InnoDB table format is not always the best option, since InnoDB require the highest resource usage, over other kind of tables available on MySQL RDBMS.