DCT HASH MATCHING QUALITY FOR RESIZED IMAGES 2

pHash does it’s mathematical operations for every pixels for original image size. Therefore, when the image is resized, the result is slightly different depending on image size. My assumption is that if every image is resized to certain size when the image is bigger than the size, the general matching quality would be better.

I tested the same set of image samples with previous posting, however, because of the speed, the comparison performed for 3644 images.

To find which size is good for normalization, I resized images to 2000, 1500, and 1000 width. And hamming distance between resized image to from 90% to 10%.

 

Hamming Distance is bigger than 4

normalization size 2000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 1   100% 0.00 0.00 0.00 0.00 0.00 0.01
90% 0 0 0 0 6 17   90% 0.00 0.00 0.00 0.00 0.08 0.23
80% 0 0 0 1 12 19   80% 0.00 0.00 0.00 0.01 0.16 0.25
70% 0 0 0 1 18 36   70% 0.00 0.00 0.00 0.01 0.24 0.48
60% 0 0 0 12 48 87   60% 0.00 0.00 0.00 0.16 0.64 1.16
50% 0 0 3 26 77 141   50% 0.00 0.00 0.04 0.35 1.03 1.89
40% 0 0 9 62 172 272   40% 0.00 0.00 0.12 0.83 2.30 3.64
30% 1 12 54 156 333 475   30% 0.01 0.16 0.72 2.09 4.45 6.35
20% 27 99 246 424 693 851   20% 0.36 1.32 3.29 5.67 9.27 11.38
10% 163 360 753 1093 1442 1636   10% 2.18 4.82 10.07 14.62 19.29 21.89

normalization size 1500

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 1   100% 0.00 0.00 0.00 0.00 0.00 0.01
90% 0 0 0 0 2 13   90% 0.00 0.00 0.00 0.00 0.03 0.17
80% 0 0 0 0 7 14   80% 0.00 0.00 0.00 0.00 0.09 0.19
70% 0 0 0 1 15 33   70% 0.00 0.00 0.00 0.01 0.20 0.44
60% 0 0 0 2 25 64   60% 0.00 0.00 0.00 0.03 0.33 0.86
50% 0 0 0 7 46 110   50% 0.00 0.00 0.00 0.09 0.62 1.47
40% 0 0 4 25 123 223   40% 0.00 0.00 0.05 0.33 1.65 2.98
30% 0 0 18 86 247 389   30% 0.00 0.00 0.24 1.15 3.30 5.20
20% 6 27 116 257 520 678   20% 0.08 0.36 1.55 3.44 6.96 9.07
10% 137 308 654 969 1313 1507   10% 1.83 4.12 8.75 12.96 17.57 20.16

normalization size 1000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 1   100% 0.00 0.00 0.00 0.00 0.00 0.01
90% 0 0 0 0 0 11   90% 0.00 0.00 0.00 0.00 0.00 0.15
80% 0 0 0 0 0 7   80% 0.00 0.00 0.00 0.00 0.00 0.09
70% 0 0 0 0 5 23   70% 0.00 0.00 0.00 0.00 0.07 0.31
60% 0 0 0 0 6 45   60% 0.00 0.00 0.00 0.00 0.08 0.60
50% 0 0 0 0 26 90   50% 0.00 0.00 0.00 0.00 0.35 1.20
40% 0 0 0 3 56 156   40% 0.00 0.00 0.00 0.04 0.75 2.09
30% 0 0 2 17 132 274   30% 0.00 0.00 0.03 0.23 1.77 3.67
20% 0 4 39 122 354 512   20% 0.00 0.05 0.52 1.63 4.74 6.85
10% 61 161 406 679 999 1193   10% 0.82 2.15 5.43 9.08 13.36 15.96

Hamming Distance is bigger than 6

normalization size 2000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 0   100% 0.00 0.00 0.00 0.00 0.00 0.00
90% 0 0 0 0 0 1   90% 0.00 0.00 0.00 0.00 0.00 0.01
80% 0 0 0 1 2 3   80% 0.00 0.00 0.00 0.01 0.03 0.04
70% 0 0 0 0 4 11   70% 0.00 0.00 0.00 0.00 0.05 0.15
60% 0 0 0 0 8 21   60% 0.00 0.00 0.00 0.00 0.11 0.28
50% 0 0 0 6 20 46   50% 0.00 0.00 0.00 0.08 0.27 0.62
40% 0 0 4 21 46 94   40% 0.00 0.00 0.05 0.28 0.62 1.26
30% 0 0 11 45 106 175   30% 0.00 0.00 0.15 0.60 1.42 2.34
20% 4 14 63 142 286 381   20% 0.05 0.19 0.84 1.90 3.83 5.10
10% 59 153 347 539 752 869   10% 0.79 2.05 4.64 7.21 10.06 11.63

normalization size 1500

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 0   100% 0.00 0.00 0.00 0.00 0.00 0.00
90% 0 0 0 0 0 1   90% 0.00 0.00 0.00 0.00 0.00 0.01
80% 0 0 0 0 0 1   80% 0.00 0.00 0.00 0.00 0.00 0.01
70% 0 0 0 0 2 9   70% 0.00 0.00 0.00 0.00 0.03 0.12
60% 0 0 0 0 6 19   60% 0.00 0.00 0.00 0.00 0.08 0.25
50% 0 0 0 1 10 36   50% 0.00 0.00 0.00 0.01 0.13 0.48
40% 0 0 0 8 28 76   40% 0.00 0.00 0.00 0.11 0.37 1.02
30% 0 0 3 26 81 150   30% 0.00 0.00 0.04 0.35 1.08 2.01
20% 1 4 30 88 221 316   20% 0.01 0.05 0.40 1.18 2.96 4.23
10% 39 99 257 433 639 756   10% 0.52 1.32 3.44 5.79 8.55 10.11

normalization size 1000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 0   100% 0.00 0.00 0.00 0.00 0.00 0.00
90% 0 0 0 0 0 1   90% 0.00 0.00 0.00 0.00 0.00 0.01
80% 0 0 0 0 0 1   80% 0.00 0.00 0.00 0.00 0.00 0.01
70% 0 0 0 0 1 8   70% 0.00 0.00 0.00 0.00 0.01 0.11
60% 0 0 0 0 2 15   60% 0.00 0.00 0.00 0.00 0.03 0.20
50% 0 0 0 0 9 35   50% 0.00 0.00 0.00 0.00 0.12 0.47
40% 0 0 0 0 14 62   40% 0.00 0.00 0.00 0.00 0.19 0.83
30% 0 0 0 4 35 104   30% 0.00 0.00 0.00 0.05 0.47 1.39
20% 0 0 11 39 138 233   20% 0.00 0.00 0.15 0.52 1.85 3.12
10% 14 38 135 270 449 566   10% 0.19 0.51 1.81 3.61 6.01 7.57

 

 

Conclusion

According to the test result, in terms of matching percentage, resizing before hashing gives better results; this can be a solution for better matching. However, false positive matching percentage is important.

DCT Hash matching quality for resized images

DCT Hash in pHash is selected as image similarity search algorithm for Creative Commons image license search. Recently, we found that some images are not matched when they are resized. So, I tested it for flickr CC images.

Firstly, I resized image to 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10%. Those resized image is hashed and calculated hamming distance from 100%. Since image size matters, I categorized images depending of original size to bigger than 5000 pixels width, 4000~5000, 3000~4000, 2000~3000, 1000~2000, and smaller than 1000 pixels.

Total image count was 7475 images.

Image count that the hamming distance is bigger than 4

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 5 21 31 39 53 71
80% 8 28 46 58 80 97
70% 13 32 60 85 128 172
60% 23 71 123 170 244 322
50% 30 97 173 246 359 491
40% 65 182 344 490 712 908
30% 125 339 626 861 1217 1519
20% 236 577 1079 1472 2012 2349
10% 505 1080 1983 2698 3419 3823

Percentage of images that the hamming distance is bigger than 4

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 0.07 0.28 0.41 0.52 0.71 0.95
80% 0.11 0.37 0.62 0.78 1.07 1.30
70% 0.17 0.43 0.80 1.14 1.71 2.30
60% 0.31 0.95 1.65 2.27 3.26 4.31
50% 0.40 1.30 2.31 3.29 4.80 6.57
40% 0.87 2.43 4.60 6.56 9.53 12.15
30% 1.67 4.54 8.37 11.52 16.28 20.32
20% 3.16 7.72 14.43 19.69 26.92 31.42
10% 6.76 14.45 26.53 36.09 45.74 51.14

Image count that the hamming distance is bigger than 6

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 0 1 1 1 2 4
80% 2 3 7 10 12 17
70% 4 6 14 20 27 38
60% 6 13 23 35 50 76
50% 11 22 42 59 83 129
40% 19 58 99 140 207 297
30% 27 102 195 286 425 577
20% 79 227 441 612 896 1091
10% 249 579 1064 1475 1907 2159

Percentage of images that the hamming distance is bigger than 6

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 0.00 0.01 0.01 0.01 0.03 0.05
80% 0.03 0.04 0.09 0.13 0.16 0.23
70% 0.05 0.08 0.19 0.27 0.36 0.51
60% 0.08 0.17 0.31 0.47 0.67 1.02
50% 0.15 0.29 0.56 0.79 1.11 1.73
40% 0.25 0.78 1.32 1.87 2.77 3.97
30% 0.36 1.36 2.61 3.83 5.69 7.72
20% 1.06 3.04 5.90 8.19 11.99 14.60
10% 3.33 7.75 14.23 19.73 25.51 28.88

 
 

Conclusion

The result shows when the image is resized, there could be some images that are cannot detected. Possible solution is resizing the image to a certain size when the image is bigger than the size before hashing. I tested when the size is 2000, 1500, and 1000 width.

64bit unsigned long long type transfer between Javascript and C++ Daemon

Currently, APIs to add and match image license get a pHash value that are extracted from image. This hash value is 64bit binary. For the fast processing, database and C++ daemon used it as unsigned long long type. However, recently, while Anna is developing Javascript pHash module, there was a problem. When Javascript calculation print the output hash value, last 4 or 5 characters were wrong values. That was because maximum value of number in javascript was 2^53.

  • Max value of integer in Javascript :
    2^53 : 9007199254740992 : 0x20000000000000

  • Max value of unsigned long long :
    2^64 : 18446744073709551615 : 0xFFFFFFFFFFFFFFFF

There are two solutions:

  1. Using Big integer library like http://silentmatt.com/biginteger/
  2. Using Hexadecimal String for output

First solution has a benefit : another modules do not have to be changed. Second solution’s benefit is that doesn’t need additional Javascript library.
We decided to use solution 2, because

  1. hash value is used only to be sent to php API page
  2. do not need calculation
  3. later, when another hash algorithm is used, it can be much longer
  4. when additional Javascript library is used, client implementation will be slower.

After adopting this solution, following modules are affected.

  • javascript : added code to change from binary string to hexadecimal string

  • phash : hash generator from image
    I changed the code from generating integer string to generating hexadecimal string.

//printf("%llun", tmphash);
printf("%016llXn", tmphash);
  • hamming : hamming distance calculator from two hash values
    I changed it to get hexadecimal string :
//    ulong64 hash1 = strtoull(argv[1], NULL, 10);
//    ulong64 hash2 = strtoull(argv[2], NULL, 10);
    ulong64 hash1 = strtoull(argv[1], NULL, 16);
    ulong64 hash2 = strtoull(argv[2], NULL, 16);
  • regdaemon : C++ daemon
    I changed add/match command so it gets hexadecimal string.
//uint64_t uiHash = std::stoull(strHash);
uint64_t uiHash = std::stoull(strHash, 0, 16);

php API doesn’t have to changed because it bypasses by base64 encoding.

For MySQL database field, we decided to keep 64bit unsigned integer type for DCT hash value. That is because this way doesn’t need to be changed from string type to number type to load on the memory for indexing.

libstdc++.so.6 library mismatch problem and solution

Problem

When I tried to run a executable that had been built at other machine, it showed following error :

$ ./regdaemon
./regdaemon: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./regdaemon)

 

Analysis

The reason of this error was because dynamic linking library libstdc++.so.6's version was lower than the library version used in the build machine.

On the build machine, the library is like following:

/usr/lib/x86_64-linux-gnu$ ll libstdc*
lrwxrwxrwx 1 root root      19 Nov  4  2014 libstdc++.so.6 -> libstdc++.so.6.0.20
-rw-r--r-- 1 root root 1011824 Nov  4  2014 libstdc++.so.6.0.20

This means that the library that is actually used by the executable is libstdc++.so.6.0.20 and libstdc++.so.6 links to it. This library is installed with new gcc.

On the other machine that showed error, the library was like following:

/usr/lib/x86_64-linux-gnu $ ll libstdc*
lrwxrwxrwx 1 root root     19 May 14 14:11 libstdc++.so.6 -> libstdc++.so.6.0.19
-rw-r--r-- 1 root root 979056 May 14 14:41 libstdc++.so.6.0.19

libstdc++.so.6 links to libstdc++.so.6.0.19 and it is older version than on the build machine.

 

Solution

Since the machine was linux mint, which was debian, newest gcc can be installed by following command :

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.9

Then the library is updated like this :

/usr/lib/x86_64-linux-gnu $ ll libstdc*
lrwxrwxrwx 1 root root      19 Apr 23 13:00 libstdc++.so.6 -> libstdc++.so.6.0.21
-rw-r--r-- 1 root root 1541600 Apr 23 13:23 libstdc++.so.6.0.21

Now, because installed library was newer than in the build machine, the executable worked well.

The other solution will be linking statically by adding <code>-static-libgcc</code> option.

additional information

Which files(file/socket etc.) are opened by a process can be seen using "lsof" utility.

hosung@hosung-Spectre:~$ lsof -p 6002
COMMAND    PID   USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
regdaemon 6002 hosung  cwd    DIR                8,2     4096 2589221 /home/hosung/cdot/ccl/regdaemon/Debug
regdaemon 6002 hosung  rtd    DIR                8,2     4096       2 /
regdaemon 6002 hosung  txt    REG                8,2  1066943 2545008 /home/hosung/cdot/ccl/regdaemon/Debug/regdaemon
regdaemon 6002 hosung  mem    REG                8,2    47712 2117917 /lib/x86_64-linux-gnu/libnss_files-2.19.so
regdaemon 6002 hosung  mem    REG                8,2    14664 2117927 /lib/x86_64-linux-gnu/libdl-2.19.so
regdaemon 6002 hosung  mem    REG                8,2   100728 2101352 /lib/x86_64-linux-gnu/libz.so.1.2.8
regdaemon 6002 hosung  mem    REG                8,2  1071552 2117915 /lib/x86_64-linux-gnu/libm-2.19.so
regdaemon 6002 hosung  mem    REG                8,2  3355040 6921479 /usr/lib/x86_64-linux-gnu/libmysqlclient.so.18.0.0
regdaemon 6002 hosung  mem    REG                8,2  1840928 2117938 /lib/x86_64-linux-gnu/libc-2.19.so
regdaemon 6002 hosung  mem    REG                8,2    92504 2097171 /lib/x86_64-linux-gnu/libgcc_s.so.1
regdaemon 6002 hosung  mem    REG                8,2  1011824 6846284 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
regdaemon 6002 hosung  mem    REG                8,2  1112840 6830431 /usr/lib/libmysqlcppconn.so.7.1.1.3
regdaemon 6002 hosung  mem    REG                8,2   141574 2117939 /lib/x86_64-linux-gnu/libpthread-2.19.so
regdaemon 6002 hosung  mem    REG                8,2   149120 2117935 /lib/x86_64-linux-gnu/ld-2.19.so
regdaemon 6002 hosung    0u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    1u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    2u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    3u  IPv4              63342      0t0     TCP localhost:60563->localhost:mysql (ESTABLISHED)
regdaemon 6002 hosung    4u  unix 0x0000000000000000      0t0  114861 /tmp/cc.daemon.sock

CC Image License Search Engine API Implementation

Picture

CC - New Page

UI

Previously, my colleague Anna made a page that search similar images by uploading or from the link. This UI page can be either inside the server or outside the sever. It uses only PHP API without accessing Database directly.
 

PHP API

This is open API that have functions of Adding, Deleting, and Matching image. It can be accessed by anyone who want this function. UI page or client implementation such as browser extension uses this API. The matching result is JSON format.
This API page Add/Delete/Match by asking “C++ Daemon” without changing Database.
Only for read-only access to the Database will be permitted.
 

C++ Daemon

All adding/deleting operation will be done in this daemon. By doing so, we can remove the problem of synchronization between database and index for matching. That is because this daemon will have content index on the memory all the time for fast matching.
Because this daemon is active all the time, to get the request and give result to “PHP API”, it works as domain socket server. PHP API will request using domain socket.
 

MySQL

Database contains all metadatas about CC license images and thumbnail path that are used to show as a preview in the matching result.

Pastec Test for Performance

So far, I tested Pastec in terms of the quality of image matching. In this posting, I tested speed of adding and searching.

Adding images to index

Firstly I added 100 images. Adding 100 images took 48.339 seconds. Then I added all directory from 22 to 31. Those images are uploaded to wikimedia commons from 2013.12.22 to 2013.12.21.

Directory Start End Duration Count Average
22 17:32:42 18:43:50 01:11:08 8785 00:00.49
23 18:43:50 19:42:03 00:58:13 7314 00:00.48
24 19:42:03 20:28:56 00:46:53 6001 00:00.47
25 20:28:57 21:28:02 00:59:05 7783 00:00.46
26 21:28:02 22:41:12 01:13:10 9300 00:00.47
27 22:41:19 23:54:28 01:13:09 9699 00:00.45
28 00:54:28 01:53:23 00:58:55 7912 00:00.45
29 00:53:23 02:27:42 01:34:19 11839 00:00.48
30 02:27:42 03:31:48 01:04:06 8827 00:00.44
31 03:31:48 04:23:15 00:51:27 6880 00:00.45

Average time for adding an image was around 0.46 second and it didn’t increased as the index grows. Most of the time for adding an image is extracting features.
I saved the index file for 100 images, from 22 to 26, and from 22 to 31. The size were 8.7mb, 444.1mb, and 935.8mb respectively.

 

Searching images

I loaded the index file for 100 images. And searched all 100 images that are used to add.

Directory Start End Duration Count Average
22 00:01:14 100 00:00.74

Searching took 1m14.781s. Since it is 100 images, average time to add one image was 0.74 second.

Then I loaded the index file that contains index for 39,183 images in the directory from 22 to 26.

Directory Start End Duration Count Average
22 09:00:05 11:21:06 02:21:01 8785 00:00.96
23 11:21:06 13:13:52 01:52:46 7314 00:00.93
24 13:13:52 14:48:26 01:34:34 6001 00:00.95
25 14:48:26 16:48:44 02:00:18 7783 00:00.93
26 16:48:44 19:13:11 02:24:27 9300 00:00.93

This time, average time for searching one image was 0.95 second.

Then I loaded the index file that contains index for 84,340 images that are in the directory from 22 to 31.

Directory Start End Duration Count Average
22 19:32:54 22:44:09 03:11:15 8785 00:01.31
23 20:44:09 23:16:59 02:32:50 7314 00:01.25
24 01:16:59 03:24:52 02:07:53 6001 00:01.28
25 03:24:52 06:11:33 02:46:41 7783 00:01.28
26 06:11:33 09:30:53 03:19:20 9300 00:01.29

Searching performed for the same images from 22 to 26. Average time for searching was 1.3 seconds.

Conclusion

  • Adding an image took 0.47 second.
  • Adding time didn’t varied by index size.
  • Searching an image varied by index size.
  • When the index size was 100, 39183, and 84340, searching time was 0.74, 0.95, and 1.3 seconds, respectively.
    Screenshot from 2015-06-28 23:14:15
    In the chart, y-axis is time in milliseconds. Around 0.6 second is likely to be for reading an image and extracting features. And searching time will be increased in proportion to the size of index.

scp/sftp through ssh turnnel

SSH Tunneling

Machine CC can be connected from another machine called zenit.
To do scp to CC through zenit, following command establish a ssh tunnel to CC.

ssh -L 9999:[address of CC known to zenit]:22 [user at zenit]@[address of zenit]
in my case,
ssh -L 9999:10.10.10.123:22 user1234567@zenit.server.ca

Now, 9999 port of localhost(127.0.0.1) is for tunnel to CC through zenit.
This session need to be alive to do all followings.

 

SCP through the SSH Tunnel

Then these commands do scp from local test.png file to CC:~/tmp and copy from CC:/tmp/test.png to ..

scp -P 9999 test.png ccuser@127.0.0.1:~/tmp
scp -P 9999 ccuser@127.0.0.1:~/tmp/test.png .

 

Making it easy

Typing those long command is not a good idea.
I added an alias to .bashrc.

alias ccturnnel='ssh -L 9999:10.10.10.123:22 user1234567@zenit.server.ca'

Then wrote two simple bash script.

This is cpfromcc.

#!/bin/bash
var=$(echo $1 | sed 's/\/home\/hosung/~/g')
remote=$var
scp -P 9999 ccuser@127.0.0.1:$remote $2

This is cptocc.

#!/bin/bash
values=""
remote=""
i=1
for var in "$@"
do
    if [ $i -ne $# ]
    then
        values="$values $var"
    else
        var=$(echo $var | sed 's/\/home\/hosung/~/g')
        remote=$var
    fi
    ((i++))
done
scp -P 9999 $values ccuser@127.0.0.1:$remote

The reason why I use sed for remote path is because bash changes ~ to my home directory.
Now I can establish ssh tunnel by typing ccturnnel.
Then I can do scp from my machine to CC using :

cptocc test.jpg test2.jpg ~

And I can do scp from CC to my machine using :

cpfromcc ~/remotefile.txt .

 

Making it convenient using sftp

When the tunnel is established, sftp is the same.

$ sftp ccuser@127.0.0.1:9999

 

Making it more convenient using Krusader

By typing sftp://ccuser@127.0.0.1:9999 in the URL bar of the Krusader, and then by adding the place to the bookmark, the remote machine’s file system is easily accessed.

Screenshot from 2015-06-26 10:23:39

Mounting it using sshfs also will be possible.

Pastec Test for real image data

In the previous test of Pastec, I used 900 jpeg image that was mainly computer generated images. This time, I tested images from WikiMedia Commons Archive of CC License Image that are uploaded from 2013-12-25 to 2013-12-30. They are zip file 17GB to 41GB and it contains around 10,000 files including jpg, gif, png, tiff, ogg, pdf, djvu, svg, and webm. Before testing, I deleted xml, pdf, djvu and webm. Then there are 55,643 images.

Indexing

Indexing 55,643 images took around 12 hours and Index file was 622mb. At first, I made separate index files for each day. However, Pastec can load only 1 index file. So I added all 6 days’ images and saved it to one index file.

While indexing there are some errors.

  1. Pastec uses OpenCV, and OpenCV doesn’t support gif and svg. For these two format, OpenCV didn’t open.
  2. Pastec adds images that is bigger than 150×150 pixel.
  3. There are zero bytes images : 153 files in 55,643 files. However on the web page of wikimedia, there are valid images. Anyways it causes an error.
  4. One tiff image cause crash inside the Pastec. It need debugging.

Searching

After loading the 622 mb index file, images can be searched. Searching 55,643 images took around 15 hours. Every searching process, it extracts features before searching, therefore, searching takes more time.

Search result

Among 55,643 images, 751 images(1.43%) are smaller than 150×150, so they were not added. 51479 images are proper size, proper format for OpenCV, they are indexed and can be searched.

  • 42,931 (83%) images are matched with only themselves (exactly the same image)
  • 8,459 (15%) images are matched more than one image
  • 90 (0.17%) images are not matched with any images even with themselves.

Images didn’t match with any images

These 90 images are properly indexed, but didn’t match even with themselves.

  • 55 images were png image that include transparency. Other than this case, jpg images
  • 14 images were long panorama images like followings

resize_BatoFerryBambangNVjf0019_01
resize_BatoFerryBambangNVjf0019_11
resize_Biblia_Wujka_1840_V_I_p_2_s_001
resize_Biblia_Wujka_1840_V_I_p_2_s_39
resize_Biblia_Wujka_1840_Vol._I_part_2._s._84
resize_Cyanate_resonance_structures
resize_LamutRiverBridgeCordilleraIfugaojf0529_01
resize_LamutRiverBridgeCordilleraIfugaojf0529_12
resize_LamutRiverBridgeCordilleraIfugaojf0529_13
resize_Quezon,NuevaVizcayaBridgejf0120_01
resize_Quezon,NuevaVizcayaBridgejf0120_11
resize_Quezon,NuevaVizcayaBridgejf0120_12
resize_Svilaja_panorama_from_the_top

  • 6 images were simple images like followings

__Amore_2013-12-30_14-18 __Bokeh_(9775121436) __Bokeh_(9775185973) __Hmm_2013-12-30_16-54 __Moon_early_morning_Shot

  • 8 vague images : lines are not clear and photographs that are out of focus

__20131229141153!Adrien_Ricorsse SONY DSC __Llyn_Alwen,_Conwy,_Cymru_Wales_21 __Minokaya_junior_high_school __Moseskogbunn __Nella_nebbia._Franco_Nero_e_Valeria_Vaiano_in_Mineurs_-_Minatori_e_minori SONY DSC SONY DSC

  • Other cases
    __Brännblåsa_på_fingret_2013-12-26_13-40 __Pottery_-_Sonkh_-_Showcase_6-15_-_Prehistory_and_Terracotta_Gallery_-_Government_Museum_-_Mathura_201d247a1ec8535aec4f9bf86066bd10dd
    These two images are a bit out of focus.

__Jaisen_Wiki_Jalayathra_2013_Alappuzha_Vembanad_Lake26 __Jaisen_Wiki_Jalayathra_2013_Alappuzha_Vembanad_Lake41 __Jaisen_Wiki_Jalayathra_2013_Alappuzha_Vembanad_Lake42

__Logo-151-151
Original image size of this is 150×150 pixel. May be it is too small and simple.

Images matched with more than one image

8,459 images were matched with more than one images. To compare the result, I generated an html file that shows all match results like following :
Screenshot from 2015-06-24 16:29:49

I converted all images to 250×250 pixel using convert -resize 250x250 filename command to show it on one page. The html file size was 6.8 mb and it shows 64,630 images.

As I mentioned on my previous blog, Pastec is good for detecting rotated/cropped image.
Almost all matching was reasonable(similar). Followings are significant matchings :
20131225102452!Petro_canada Petro_canada_info

20131225193901!Bitlendik Bitlendik-avatar

In these two cases, the logo was matched.

20131225212947!June_Allyson_Dick_Powell_1962 June_Allyson_Dick_Powell_1962 Aveiro_342

This matching looks like false positive.

Buddhapanditsabhacandapuri Aveiro_144

This matching also is false positive.

NZ-SH45_map NZ-SH30_map NZ-SH38_map

In this case, the map is shifted.

PK-LJH_(Boeing_737-9GP)_from_Lion_Airlines_(9722507259) Blick über die Saalenbergkapelle in Sölden zum Kohlernkopf

This is obvious false positive, maybe sharp part of the airplane and the roof part was matched.

From my observation, obvious false positive matching that doesn’t share any object was less than 50, which was 0.08%. Usually when the image contains graphs or documents, there were wrong matching. When the image was normal photograph, the result was very reliable.

Pastec analysis

Pastec works as following order :

  1. Load visual words : visualWordsORB.dat file contains it, the size is 32,000,000 bytes. Loading the file takes around 1 seconds.
  2. Building the word index : using the visual words, builds word index; it takes around 13 seconds.
  3. Now previously saved index file can be loaded, or an image can be added to the index.
  4. Using an image file, similar image file that contains similar word indexes can be searched.
  5. Index in the memory can be written to a file

Adding new image to the index works as following order :

  1. Using OpenCV, ORB features are extracted.
  2. Matching visual words are searched.
  3. Matching visual words are indexed on the memory

When I added 900 images, the size of index file was 16,967,440 bytes.

By changing source code, I saved matching visual word list to the text file for each images. Each word matching stored using this struct :

struct HitForward
{
    u_int32_t i_wordId;
    u_int32_t i_imageId;
    u_int16_t i_angle;
    u_int16_t x;
    u_int16_t y;
};

Each word matching has word id, image id, angle, and x/y coordination. Saved file looks like this (order of ImageID,Angle,x,y,WordId) :

469,55772,417,111,99042
469,46096,424,453,261282
469,4246,866,265,40072
469,44288,855,295,635378
469,59150,735,268,28827
469,12526,529,112,139341
469,12513,500,39,172187
469,48546,615,59,288827

It contains 1593 lines, which means it has 1593 matching words. Image id 469 was Jánské.jpg and the image looks like this :
Jánské
The size of this image is 12.8 mb. Like other HDR images, it contains lots of features. Also it has biggest number of matching words among 900 images. When the data was written to the text file, the size was 39,173 bytes, it would be the worst case. When the image is simple, only few words are matched. Full size of matching word text files of 900 images was 20.9 mb.

To reduce it, I made a simple binary format. Since the image id is the same for an image, I wrote it once, and it is followed by 4 bytes count. Then every word is written as 4 bytes word id, 2 bytes angle, 2 bytes x, and 2 bytes y.

4 bytes - id
4 bytes - count
4,2,2,2 (10 bytes) *  count

In case of id 469 image, the size is 11,238 bytes. And the file looks like this :

00000000: d501 0000 3906 0000 e282 0100 dcd9 a101  ....9...........
00000010: 6f00 a2fc 0300 10b4 a801 c501 889c 0000  o...............
00000020: 9610 6203 0901 f2b1 0900 00ad 5703 2701  ..b.........W.'.
00000030: 9b70 0000 0ee7 df02 0c01 4d20 0200 ee30  .p........M ...0
00000040: 1102 7000 9ba0 0200 e130 f401 2700 3b68  ..p......0..'.;h
00000050: 0400 a2bd 6702 3b00 b094 0800 c64c 5f02  ....g.;......L_.

0x1d5 is 469 and 0x639 is 1593.
In this case, the size was 15938 bytes, which was 15 kb, around 34% of text format (39 kb).
Since this image is the worst case, storing all binary index to database for all image record is realistic.
Full size of all 900 images was 8.5 mb. (text file was 20.9 mb)
Interestingly, it is smaller than index file for 900 images (16.2 mb)

Conclusion

I was thinking of saving index file. However, saving word list for each image will be the better solution because when it is binary format, it consumes less storage and adding it to the index is very fast. Also, when it is stored as a database field, synchronization between index and database is not a problem.

How to import CMake project in Eclipse CDT4

Currently I am analysing Pastec; it uses CMake as a build system. To split them up, I wanted to analyse it using the functionality of Eclipse.

Pastec can be built using following order.

$ git clone https://github.com/Visu4link/pastec.git
$ mkdir build
$ cd build
$ cmake ../
$ make

To build Pastec in Eclipse CDT, instead of doing “cmake ..”, following order need to be done. (Debug build)

$ cd build
$ cmake -G"Eclipse CDT4 - Unix Makefiles" -D CMAKE_BUILD_TYPE=Debug ..

Then, it can be imported into Eclipse:

  1. Import project using Menu File->Import
  2. Select General->Existing projects into workspace:
  3. Browse where your build tree is and select the root build tree directory(pastec/build). Keep “Copy projects into workspace” unchecked.
  4. You get a fully functional eclipse project

Reference

http://pastec.io/doc#setup
http://www.cmake.org/Wiki/Eclipse_CDT4_Generator