Devart Blog

Data Comparison Methods Overview

Posted by on March 9th, 2010

Data comparison is a difficult and resource-intensive process. For convenience, this process can be divided into several steps.
First, you should compare tables from one database on one server with the database on the other server. You should choose columns for data comparison, and also choose a column that will be a comparison key.
The next step is to choose all data from these tables or some specified part of the data.
The third and the most important step is comparison of the two tables by the selected comparison key itself. During this process the status of each record is set to “only in source”, “only in target”, “different”, or “equal”.
The final steps of the data comparison process are including records to the synchronization and synchronization itself. During these steps records needed for synchronization are chosen, update script is created, and after that the script is executed.
You can read a detailed description of the comparison process here.

Now let’s look at the third step (data comparison) thoroughly.

There are several ways of data comparison that differ only by the side where data comparison is going to be performed – on the server side or on the client PC.

Data comparison on the server side is performed using the resources of the server.
The algorithm of comparison is the following:
1. For each record of each of the two tables its checksum is calculated;
2. Then the checksum of every record from one table is compared to the checksum of the corresponding record from another table and conclusion if the records are equal or different is made;
3. The comparison result is stored in a temporary table on the server.

Performance indicators:
1. The speed of data comparison directly depends on the server capacity and occupancy;
2. The maximal size of database for comparison is limited by the resources of the server itself.

Advantages:
1. There is no need to transfer large amounts of data for comparison to the client PC through network. This way we save network traffic;
2. The speed of comparison does not depend on the client PC resources;
3. Ability to compare blob data of any size.

Disadvantages:
1. Because of the record checksum calculation algorithm in some cases different data can result in equal checksum, and instead of the expected “different” status the “equal” status will be received;
2. There is no flexibility in the synchronization and comparison options usage;
3. There is no possibility to view records differences and exclude a part of the records from the synchronization manually;
4. During the synchronization script creation you should perform data transfer from the server to the client side;
5. The control checksum calculation of a large amount of records consumes all server resources;
6. One should provide extra space on the server for the comparison results storage in the temporary table.

As we can see, this way of comparison has more disadvantages than advantages, that’s why this way is rarely used.

Data comparison on the client PC is performed using the client machine resources, and the server only provides data for comparison. In turn, this way of comparison can be divided into several more ways depending on the way how comparison information will be stored.

Comparing Data on local PC when comparison result is stored in RAM.
The comparison algorithm is the following:
1. Server passes all data from both tables to the local PC;
2. Every record of every table is placed to RAM and is compared without checksum calculation;
3. If a record gets “only in source”, “only in target” or “equal” status, only comparison key is stored in RAM. If records get “different” status, they are placed to RAM for storage completely.

Performance indicators:
1. The speed of data comparison directly depends on the client PC resources and on the speed of data transfer through the network;
2. Maximum size of the database for comparison is limited by the size of RAM on the client PC, and this maximum size also depends on the degree to which the databases that should be compared are different – the smaller is the amount of different records, the larger databases can be compared.

Advantages:
1. Minimal server occupancy – server performs only simple data selection;
2. The simplest algorithm of data comparison because records are sorted on the client side;
3. Flexibility in the comparison options usage;
4. Minimal size of the comparison data store;
5. Status of every record for any data is always correct.

Disadvantages:
1. To view records with “only in source”, “only in target”, or “equal” status an extra data selection is needed;
2. An extra data selection is needed to create a synchronization script;
3. OutOfMemory Exception may be arisen when there are a lot of differences in data in databases;
4. Possibility to compare blob data only of the size that equals to the size of free RAM.

This way of comparison is implemented in dbForge Data Compare for SQL Server v1.10,
dbForge Data Compare for MySQL v2.00
and allows to compare databases of any size if data in these databases does not differ a lot.

Comparing Data on local PC when comparison result is stored as a cashed file on the disk.
The algorithm of comparison is the following:
The server passes all data from both tables sorted by comparison key to a local PC. Data is read by bytes, compared without checksum calculation and written to a file on the disk.

Performance indicators:
1. The speed of data comparison directly depends on the client PC resources and on the speed of data transferring through the network;
2. The maximum size of a database to compare is limited by free disk space and does not depend on the degree of data difference in databases.

Advantages:
1. Medium server occupancy – server performs data sorting and selection;
2. To view records and synchronization script creation extra requests to the server are not necessary;
3. The status for every record is always correct for any data;
4. Possibility to compare blob data of the size equal to the size of free space available on the disk.

Disadvantages:
1. Difficult algorithm of data comparison for the records comparison key of which is of the string data type;
2. Difficult algorithm of disk cash for temporary information storage creation.

We can see that in this case the only disadvantage of this way of comparison is the difficulty of implementation. There are more advantages than in the ways of comparison listed above. That’s why this way of comparison will be used in the new version of dbForge Data Compare for SQL Server v2.00 and
dbForge Data Compare for MySQL v3.00 for data comparison.

Leave a Reply