Modelling databases is not a simple task. Many times we found some misspelled words, mixed case words and some horrifying things on many databases of third party systems. My rules are simple to follow, and most of them ensures a well designed database, with special care on how do we query that database, creating reliable and optimised for speed table design and queries.
table design
You must follow a simple rule, fixed length data types goes first, from the one that fits with CPU register length to those types that have variable length. For example a a simple table design for MySQL could be as follows:
CREATE TABLE customer ( customer_id bigint(11) NOT NULL AUTO_INCREMENT, cus_priority int NOT NULL, cus_state int NOT NULL, cus_products int NOT NULL, cus_date datetime NOT NULL, cus_name varchar(100) NOT NULL, cus_email varchar(50) NOT NULL, PRIMARY KEY USING BTREE (customer_id) ) ENGINE= InnoDB DEFAULT CHARSET= utf8
If you take a look on the column order, fixed length table keys have the first priority, fixed length data types have the second priority and the last priority on the table creation script is for those types with variable length. From the perspective of the RDBMS system that we are using, the table structure is read the first time that we query the table, then each fixed length column is read in sequence as the RDBMS systems reads each row from the disk or memory, hence each non-fixed length column is read consequently each time that a row is handled, reading the column length first and then skipping the column offset on the disk or memory to handle the next column. So, reading non-fixed length columns is hard for the RDBMS.
Imagine those tables created with variable length keys!, those tables are creating a heavy overhead on any kind of table created. So, try to use table keys of fixed length, such as int and bigint. Including those tables which have a fixed length made of char(n), are creating a heavy overhead, since those tables are using string comparison algorithms instead of memory block comparison, such as comparing integer types, which is more easy to handle.
Using a naming convention matters too. The table above has a lowercase name, lower case table fields and each field has a prefix related to the table. Why does they have a prefix?, while we are querying a table is more comfortable to identify them by the prefix that they have, specially when we are creating joined queries, allowing unique column identifiers.
query optimisation
Some older RDBMS systems do not support the usage of the join keyword, so we must create joined queries using the where statement. But nowdays in most RDBMS systems supports the join keyword, allowing more faster queries, so there is a technique allowing us to create joins from the older way of using where statements.
select pro.product_id, pro.pro_name, sum(coalesce(pay.pay_price, 0.0)) as payment from product as pro left join subscription as sus on (pro.product_id = sus.product_id) left join payment as pay on (sus.payment_id = pay.payment_id) group by pro.product_id, pro.pro_name order by pro.product_id, pro.pro_name
The query above will bring us a report about payments related to each product on the database. If you see there is no where statement. Also, examine the join conditionals, the smallest dataset is filtered at the left side of the conditional. The more we filter on the join conditionals, faster is the query result. That query took just 0.002 seconds. Let’s see what happens with a non-joined query and using the where statement.
select pro.product_id, pro.pro_name, sum(coalesce(pay.pay_price, 0.0)) as payments from product as pro, subscription as sus, payment as pay where (pro.product_id = sus.product_id and sus.product_id is not null) and (pro.product_id = sus.product_id or sus.product_id is null) and (sus.payment_id = pay.payment_id or pay.payment_id is null) group by pro.product_id, pro.pro_name order by pro.product_id, pro.pro_name
Using the where statement, the same query took 0.012 seconds. What happened behind on the RDBMS side? On the first query, row filtering was made while the RDBMS was reading rows from each table, hence on the second query, the selection was made over the complete dataset retrieved from all selection tables. In this case we have an enhancement of 600% on performance, on more complex queries the same happens. Also the having clause is evil too, since it process those rows retrieved after the selection process on those rows retrieved on at last.
select pro.product_id, pro.pro_name, sum(coalesce(pay.pay_price, 0.0)) as payment from product as pro left join subscription as sus on (pro.product_id = sus.product_id) left join payment as pay on (sus.payment_id = pay.payment_id) group by pro.product_id, pro.pro_name having payment > 200000 order by pro.product_id, pro.pro_name
The having clause can process aggregate columns, but that processing is made at end of the selection process, so you must try to replace that having clause by other join or where statement if you have an option to do that.
rdbms tuning
Each RDBMS has its own parameters. For example, MySQL have some parameters which allow you to have entire tables on memory, so most queries to those tables in memory are faster than reading the hard drive. If you have a system that is querying a database concurrently, you should increase the memory limit, allowing the RDBMS to maintain that table on memory. You must research which options fits for better on the RDBMS that you are using, which kind of keys and table format fits better with your needs. For example on MySQL InnoDB table format is not always the best option, since InnoDB require the highest resource usage, over other kind of tables available on MySQL RDBMS.