The Sietch - this is not a blog.
This is a no-blog.

Pseudorandom Quote:

Developers, developers, developers, developers!
Steve Ballmer

Many to Many Relationships

As one gets more experience as a developer the moments when new information can totally change understanding of basic things get rarer. Last month I was in an argument with three coworkers of mine and I was stomped so hard that I was forced to rethink and change the way I do something very basic namely the way I model many to many relationships.

In the past I just went for what I thought was the most natural approach. For example if I needed to model the relationship between users and groups where each User can be in many groups and each Group can have multiple users I would just create a cross table in the database where each row consists of UserID and GroupID. Both IDs together will be a primary key and they are both a foreign key to the corresponding tables. In my C# model I would put a collection of Groups in the User object and a collection of Users in the Group object. In the UI I would only provide a way to delete the relation and a way to add the relation but no edit. It is simple and it works. I can confirm it works because I have built multiple projects using this approach. However it turns out it is not as good as it sounds.

The Alternative

The alternative is to give the cross table its own ID, create a C# object representing the relationship and provide an option to edit the relationship in the UI. Why is this approach better?

Database Performance

Now I must admit I am a total noob when it comes to database performance. When people explain something I usually think I understand it and I agree but the problem is that if someone comes after that and explains the opposite I would probably agree again. When I was told the approach with the ID required less memory I agreed at once but did not agree that this was enough of a reason to add IDs to the cross tables. Turns out that because the primary key is an index you are actually better off having a separate ID because it is half the size and you will actually save memory due to smaller indexes. If you use the first approach you will have the benefit that the "left" column is a part of the index and it will result in faster search when this column is involved. However you can add this index and get that performance back. While at this point you are not saving memory you may gain performance. If the index is on two columns it gets bigger which means that it can spread on more pages. Algorithmically the search has the same complexity but because you will search half the pages you will get performance advantage as the table gets bigger. Of course I quickly dismissed these arguments as premature optimization.

User Interface

Next point I was arguing was that the interface becomes more complex when it had an edit option. The user would be confused what the difference was between delete and add on one side and edit on the other. We should simplify the UI by having less buttons. Again my arguments were destroyed. First of all line of business apps usually have consistent interface where each item has add, delete and edit buttons. By removing the edit button I was making the interface less consistent and therefore more complex. In addition when the normal user adds a relationship and makes a mistake he does not think "I want to remove the incorrect relationship", he thinks "I want to fix my error". The user wants to edit. The edit button does make sense.

Object Model

So I lost the UI argument and my last resort was the object model. In an OOP model many to many relationships are represented by having a collection of references to the corresponding type in both types. This is still true and if I am programming some algorithm or some framework I would probably use this representation by default. However in this case we are talking about persistence and UI. For our application the edit button was in the requirements anyway so we had to implement it. I was asked what I gain from going with the direct relationship approach. Getting the Groups a User belongs to would be easier. I guess the code for creating a relationship would be slightly simpler if I already had one side of the relationship (i.e. if you have the Group object and want to add a User). However in practice our SPA app was sending two IDs via AJAX to create a relationship. Creating a new relationship object in the alternative approach would not be much harder since I needed to get the User and Group objects anyway. If instead of objects the relationship object only used the IDs it would be even simpler. On the other hand what did I lose? First of all we had to create different code for editing relationships than what we had for everything else. This one had two IDs. Deleting a relationship also required two IDs all this code was quite different from all the other CRUD code that just used a single ID to do those operations. And what would happen if the requirements changed and a date on which the relationship was created was required? I would fall back to the approach with the ID and the separate object anyway. So basically the approach with the separate relationship object would be easier to implement and easier to change in the future. I lost on all accounts :(

One thing I was not convinced to do was rewrite the code that was tested and working in order to fit the new ideas I was just converted to. "Don't change things that work" is still something I strongly believe in.

The moral of the story is that what seems simpler at first is not always the better option.

This information comes to you thanks to my colleague Andrey Nikolov

Tags: english programming

Posted by: Stilgar

17:09 15.07.2014

Comments:

First Previous 1 Next Last

Posted by JOKe (Unregistered) on 18:12 15.07.2014

Dont agree. ( I hope I understood everything correctly ) BUT I STRONGLY DONT AGREE

First of all I dont give a shit what a designer says.. and the reason he wants EDIT of a relation in the interface and the fact that you need to support edit operation where 2 ids are passed instead of 1 is not a reason to make a crappy object model.

Second the object model... I dont care about the clients UI and the performance.. the clients UI and the performance is something I should think about after I have a great API.
The API should be exactly as you said .. in Users.. you will want a Set of Groups.. and in Group you should have a Set of Users... nothing more.. nothing fancy.. you should not have a Sperate UserToGroup entity and all calls to your API becomes a SHIT like
user.UserToGroups[i].Group ? wtf is this ?? UGLY SHIT thats what.

Third... the performance .. ha ha ... :( omg.. first of all USUALLY you will not LOOKUP by this dedicated PK ... usually you want a JOIN ... for example all groups that dont have users... all users that are in more then 2 groups.. and so on... by having a normal PK on the two FKs ... is the fastest way possible. In your example I would always write the same query and will not use the dedicated PK at all.. so why I will going to use it ?? for editing from the UI only (facepalm).

So.. I am strongly AGAINST peoples who are making MANY to MANY relation with more then 2 FKs on the table... It has always been done this way and there is a reason.

lastly.. the most optiomal way to create this "middle" table is to use COMPRESSed index... I am not sure all databases support such index... since most databases are joke (sql server there) .. but maybe the Auto-shrink option helps to SQL server to make it a little more usable.
The most optimal, FAST and MEMORY efficient way to define a MANY TO MANY relation looks like this :
CREATE TABLE USER_GROUP_REL (
USER_ID INT,
GROUP_ID INT,
PRIMARY KEY (USER_ID, GROUP_ID),
FOREIGN KEY (USER_ID) REFERENCES USER (USER_ID),
FOREIGN KEY (GROUP_ID) REFERENCES GROUP (GROUP_ID)
) ORGANIZATION INDEX COMPRESS;

CREATE INDEX USER_GROUP_REL_IE1 ON USER_GROUP_REL (USER_ID, GROUP_ID) COMPRESS;

Btw you can google more on the topic here :
http://dba.stackexchange.com/questions/15878/do-i-need-a-separate-id-column-for-this-mapping-table
here :
http://programmers.stackexchange.com/questions/182103/is-a-unique-id-column-needed-in-a-many-to-many-junction-table
and here
http://stackoverflow.com/questions/11768969/many-to-many-relationship-query-performance-with-large-table-sizes

P.S. stop writing crappy API because someone cannot optimize his database (dont have a normal DBA), cant create an UI( have a crappy designer) or cant write normal business requirements (have a crappy analyst )

Posted by Stilgar on 18:24 15.07.2014

First of all smaller index means more efficient index because it means fewer reads. This is why a separate index will be faster than a PK index. You won't query on the PK you will query on the column and it will still be faster because the index is stored in less pages.

Secondly as I pointed out in the post the API is not better if it is harder to use (i.e. the edit scenario) and harder to modify (adding additional data to the relation).

Sadly I have no idea what compressed indexes are.

Posted by JOKe (Unregistered) on 18:39 15.07.2014

about compressed index : https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::p11_question_id:8806017524595

Even if a PK index on 1 column is smaller as you said.. YOU WILL NEVER GONNA USE ITTTTTTTT... you will use it only by the crappy edit form .. where some user is editing something... which is so so so so so so so so so so so so so so not common.

the common case is to use this Many to many table in a JOIN.... and this PK... well you don't need it for the join :)

Posted by JOKe (Unregistered) on 18:47 15.07.2014

in this blog the time is very interesting... last post 15:39... I believe my comments live in the past.. :D

Posted by JOKe (Unregistered) on 18:48 15.07.2014

or my windows live in the future... it says 18:48...

where should I submit bug ? :D

Posted by Stilgar on 19:05 15.07.2014

oh fuck:( I moved it to another hosting and the way I deal with time sucks :)

It doesn't matter if you want to use the id (which you do want when you need an edit form) the index is still smaller (i.e. less pages) and will perform better. Also note that you will have only one query (edit) instead of two (delete/add)

Posted by Guest (Unregistered) on 22:20 15.07.2014

Sorry but saying "this is ugly shit" is a idiotic way to critique something. If you want a majestic API then by all means design one but that hardly matters whether you have access to the relation directly. (No not a relation in the Codd sense ).

Performance? What if you have deep object hierarchies on either side of the M2M? You going to pull all those back each time? Don't say lazy loading will work because not everyone is using 1 or 2 tiered architecture.

The relationship between the entities is a REAL one and there is no "universal law" it is to be hidden using technology.

Posted by JOKe (Unregistered) on 00:17 16.07.2014

Guest I dont see how adding the pk to this extra relation fixes the problem. Instead a set of users in a group you will have a set of groupedUsers let day each of them a many to one relation to a user.
Lazy loading helps since you dont need to pull all... all the time. Also even if you need to pull everythink all the time thats why things like second level cache exist and so on. So the thing is you will not always fetch all users, their groups and their.. adresses... andd phones.. and.. friends... andd messages and so on... and even so Why adding an extra entity solves anything? it will not make it easier to maintain and will not make it better in anyway.
And i guess you agree that the API will in fact be more crappy. I am not saying every api is a peace of art but you should give your best to make it.. and if you see something will make your model harder to understand and your api harder to use.. you probably are going on a very bad direction..

Posted by wqw (Unregistered) on 19:31 16.07.2014

@JOKe: Consider tables Orders, OrderDetails and Products. OrderDetails has an Order_ID REFERENCES Orders(ID) and Product_ID REFERENCES Products(ID) columns, so technically it's a cross table with some additional attributes for quantity, price, etc.

As far as I understand you insist on OrderDetails having a composite (natural) PK on (Order_ID, Product_ID) and that editing an order line is out of question? Is this some kind of joke or is this the right-and-only-"Oracle"-way of tormenting whimsical users?

Btw, in your sample why is the PK on (USER_ID, GROUP_ID) and not on (GROUP_ID, USER_ID)? I expect base table to be queried WHERE GROUP_ID=X as often as WHERE USER_ID=y

Posted by JOKe (Unregistered) on 11:26 18.07.2014

wqw: you are mixing what we talk about.
If you add additional attributes in the cross table.. like quantity, price and so on.. this is no longer just many to many relation so in this case sure.. you will need to edit it .. and sure you can add pk to make the edit easier (if it is easier for you).

What we are talking here is a plain many to many relations.. with NOTHING to edit except the FKs , and to have form just to edit this and to have a crappy API because of some strange PK...

So.. if you have something like ProductDetail .. or ProductEntry where you are linking a Product and a ShoppingCart.. and you need additional information like quantity, price.. maybe a discounts as well and so on sure you will add a PK but this is no longer just a MTM relation.

About the other question .. the order... yes the order meters for sure but perhaps not in the way you might expect.
Given USER_ID and GRUOP_ID, people's initial instinct is to put the more selective column first.

But, that's almost always irrelevant.

First, you need to consider how the index will be used in your SQL statements. If some SQLs only specify USER_ID, then USER_ID should probably be first. If all SQLs will always specify both columns in the where predicate, then it doesn't matter which comes first.

But, what if most of the SQLs specify both USER_ID and GROUP_ID , and a few provide only USER_ID or GROUP_ID was your question.

Well, in that case, the least selective column should (probably) go first. There are two reasons for this. First, if the least selective goes first, then, for those queries where only the second column is specified in the predicate, Oracle will be more likely to do an INDEX SKIP SCAN, if the leading column is not very selective. Second, by putting the less selective column first, you'll be more likely to be able to take advantage of index compression.

P.S. and YES.. I speak only about ORACLE... the other databases are here just for joke... to have something to play with... the fact that even Microsft SQL Server doesnt have Cluster support is so funny... and they sell this ? haha ... no idea why someone is buying it.

Posted by wqw (Unregistered) on 14:31 23.07.2014

@JOKe: Your experience with MSSQL is on par with mine on ElasticSearch -- little to non-existent.

Of course there is "Cluster support" in MSSQL. You can even cluster both on USER_ID and GROUP_ID so that both search predicates (USER_ID=x, GROUP_ID=y) are executed with a clustered index seek/scan.

What's more interesting is why you back-out on your original MTM tables firm stance? My experience is like 90% of cross tables in production have additional fields e.g. in our case Permissions field, or SequenceNo field, or auditing fields (LastUser, ModifiedOn). So now we can have PK in production, hmm... thank you for allowing us keep our sanity!

Posted by wqw (Unregistered) on 16:47 01.08.2014

... no idea what you are talking about...
compressed index ? or cluster support...cluster support I mean to start 10 instances of an SQL server and and to all work like a balancer I mean the load to be distributed.

about the compressed index.. Stilgar said there isnt .... yes I have little experience with MS SQL server, because I believe it is bad...

P.S. ElasticSearch RULZ ! :)

Posted by JOKe (Unregistered) on 16:48 01.08.2014

ops !! the last comment was from me .. to wqw...

:D:D: ops :D

Posted by Stilgar on 10:14 02.08.2014

ooops JOKe did it again...

Posted by Sanders (Unregistered) on 17:13 12.08.2014

Developers should not be involved in UX/UI creation process! They should implement.

Posted by Guest (Unregistered) on 16:11 17.08.2014

guys (Stilgar, JOKe), as developers you should avoid making decisions regarding database design. DB designers rarely (if at all) tell developers how to code. I realise that not all teams have good db designers and often (unfortunately) developers "design" DBs. I have to admit though that I tend to rather agree with JOKe's arguments (except the part about MSSQL). But all in all these are developer arguments. It would be interesting to hear (read) what an experienced DB designer has to say about all this.

Posted by Aleks (Unregistered) on 00:21 04.08.2017

Mostly agree with JOKe except the mssql thing, Sql Server is nice premium db.

Posted by Aleks (Unregistered) on 23:17 04.08.2017

surrogate key for no good reason is poor design:
-the surrogate key, which is primary is useless when it comes to find user or group by group or user
-if you need fast access to get all users with a certain group, you still need additional index and if you need fast access on all groups for a user you still need additional index.

Another very important thing to consider is if you will need that surrogate key in another table, in other words is there going to be something else in the db that needs reference for that relationship and specifically that surrogate key, and if that is true, then you have a case why you need it, otherwise it is useless and that sql developer should be fired.