DCT HASH MATCHING QUALITY FOR RESIZED IMAGES 2

pHash does it’s mathematical operations for every pixels for original image size. Therefore, when the image is resized, the result is slightly different depending on image size. My assumption is that if every image is resized to certain size when the image is bigger than the size, the general matching quality would be better.

I tested the same set of image samples with previous posting, however, because of the speed, the comparison performed for 3644 images.

To find which size is good for normalization, I resized images to 2000, 1500, and 1000 width. And hamming distance between resized image to from 90% to 10%.

 

Hamming Distance is bigger than 4

normalization size 2000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 1   100% 0.00 0.00 0.00 0.00 0.00 0.01
90% 0 0 0 0 6 17   90% 0.00 0.00 0.00 0.00 0.08 0.23
80% 0 0 0 1 12 19   80% 0.00 0.00 0.00 0.01 0.16 0.25
70% 0 0 0 1 18 36   70% 0.00 0.00 0.00 0.01 0.24 0.48
60% 0 0 0 12 48 87   60% 0.00 0.00 0.00 0.16 0.64 1.16
50% 0 0 3 26 77 141   50% 0.00 0.00 0.04 0.35 1.03 1.89
40% 0 0 9 62 172 272   40% 0.00 0.00 0.12 0.83 2.30 3.64
30% 1 12 54 156 333 475   30% 0.01 0.16 0.72 2.09 4.45 6.35
20% 27 99 246 424 693 851   20% 0.36 1.32 3.29 5.67 9.27 11.38
10% 163 360 753 1093 1442 1636   10% 2.18 4.82 10.07 14.62 19.29 21.89

normalization size 1500

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 1   100% 0.00 0.00 0.00 0.00 0.00 0.01
90% 0 0 0 0 2 13   90% 0.00 0.00 0.00 0.00 0.03 0.17
80% 0 0 0 0 7 14   80% 0.00 0.00 0.00 0.00 0.09 0.19
70% 0 0 0 1 15 33   70% 0.00 0.00 0.00 0.01 0.20 0.44
60% 0 0 0 2 25 64   60% 0.00 0.00 0.00 0.03 0.33 0.86
50% 0 0 0 7 46 110   50% 0.00 0.00 0.00 0.09 0.62 1.47
40% 0 0 4 25 123 223   40% 0.00 0.00 0.05 0.33 1.65 2.98
30% 0 0 18 86 247 389   30% 0.00 0.00 0.24 1.15 3.30 5.20
20% 6 27 116 257 520 678   20% 0.08 0.36 1.55 3.44 6.96 9.07
10% 137 308 654 969 1313 1507   10% 1.83 4.12 8.75 12.96 17.57 20.16

normalization size 1000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 1   100% 0.00 0.00 0.00 0.00 0.00 0.01
90% 0 0 0 0 0 11   90% 0.00 0.00 0.00 0.00 0.00 0.15
80% 0 0 0 0 0 7   80% 0.00 0.00 0.00 0.00 0.00 0.09
70% 0 0 0 0 5 23   70% 0.00 0.00 0.00 0.00 0.07 0.31
60% 0 0 0 0 6 45   60% 0.00 0.00 0.00 0.00 0.08 0.60
50% 0 0 0 0 26 90   50% 0.00 0.00 0.00 0.00 0.35 1.20
40% 0 0 0 3 56 156   40% 0.00 0.00 0.00 0.04 0.75 2.09
30% 0 0 2 17 132 274   30% 0.00 0.00 0.03 0.23 1.77 3.67
20% 0 4 39 122 354 512   20% 0.00 0.05 0.52 1.63 4.74 6.85
10% 61 161 406 679 999 1193   10% 0.82 2.15 5.43 9.08 13.36 15.96

Hamming Distance is bigger than 6

normalization size 2000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 0   100% 0.00 0.00 0.00 0.00 0.00 0.00
90% 0 0 0 0 0 1   90% 0.00 0.00 0.00 0.00 0.00 0.01
80% 0 0 0 1 2 3   80% 0.00 0.00 0.00 0.01 0.03 0.04
70% 0 0 0 0 4 11   70% 0.00 0.00 0.00 0.00 0.05 0.15
60% 0 0 0 0 8 21   60% 0.00 0.00 0.00 0.00 0.11 0.28
50% 0 0 0 6 20 46   50% 0.00 0.00 0.00 0.08 0.27 0.62
40% 0 0 4 21 46 94   40% 0.00 0.00 0.05 0.28 0.62 1.26
30% 0 0 11 45 106 175   30% 0.00 0.00 0.15 0.60 1.42 2.34
20% 4 14 63 142 286 381   20% 0.05 0.19 0.84 1.90 3.83 5.10
10% 59 153 347 539 752 869   10% 0.79 2.05 4.64 7.21 10.06 11.63

normalization size 1500

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 0   100% 0.00 0.00 0.00 0.00 0.00 0.00
90% 0 0 0 0 0 1   90% 0.00 0.00 0.00 0.00 0.00 0.01
80% 0 0 0 0 0 1   80% 0.00 0.00 0.00 0.00 0.00 0.01
70% 0 0 0 0 2 9   70% 0.00 0.00 0.00 0.00 0.03 0.12
60% 0 0 0 0 6 19   60% 0.00 0.00 0.00 0.00 0.08 0.25
50% 0 0 0 1 10 36   50% 0.00 0.00 0.00 0.01 0.13 0.48
40% 0 0 0 8 28 76   40% 0.00 0.00 0.00 0.11 0.37 1.02
30% 0 0 3 26 81 150   30% 0.00 0.00 0.04 0.35 1.08 2.01
20% 1 4 30 88 221 316   20% 0.01 0.05 0.40 1.18 2.96 4.23
10% 39 99 257 433 639 756   10% 0.52 1.32 3.44 5.79 8.55 10.11

normalization size 1000

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000     5000 < 4000 < 3000 < 2000 < 1000 < < 1000
100% 0 0 0 0 0 0   100% 0.00 0.00 0.00 0.00 0.00 0.00
90% 0 0 0 0 0 1   90% 0.00 0.00 0.00 0.00 0.00 0.01
80% 0 0 0 0 0 1   80% 0.00 0.00 0.00 0.00 0.00 0.01
70% 0 0 0 0 1 8   70% 0.00 0.00 0.00 0.00 0.01 0.11
60% 0 0 0 0 2 15   60% 0.00 0.00 0.00 0.00 0.03 0.20
50% 0 0 0 0 9 35   50% 0.00 0.00 0.00 0.00 0.12 0.47
40% 0 0 0 0 14 62   40% 0.00 0.00 0.00 0.00 0.19 0.83
30% 0 0 0 4 35 104   30% 0.00 0.00 0.00 0.05 0.47 1.39
20% 0 0 11 39 138 233   20% 0.00 0.00 0.15 0.52 1.85 3.12
10% 14 38 135 270 449 566   10% 0.19 0.51 1.81 3.61 6.01 7.57

 

 

Conclusion

According to the test result, in terms of matching percentage, resizing before hashing gives better results; this can be a solution for better matching. However, false positive matching percentage is important.

DCT Hash matching quality for resized images

DCT Hash in pHash is selected as image similarity search algorithm for Creative Commons image license search. Recently, we found that some images are not matched when they are resized. So, I tested it for flickr CC images.

Firstly, I resized image to 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10%. Those resized image is hashed and calculated hamming distance from 100%. Since image size matters, I categorized images depending of original size to bigger than 5000 pixels width, 4000~5000, 3000~4000, 2000~3000, 1000~2000, and smaller than 1000 pixels.

Total image count was 7475 images.

Image count that the hamming distance is bigger than 4

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 5 21 31 39 53 71
80% 8 28 46 58 80 97
70% 13 32 60 85 128 172
60% 23 71 123 170 244 322
50% 30 97 173 246 359 491
40% 65 182 344 490 712 908
30% 125 339 626 861 1217 1519
20% 236 577 1079 1472 2012 2349
10% 505 1080 1983 2698 3419 3823

Percentage of images that the hamming distance is bigger than 4

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 0.07 0.28 0.41 0.52 0.71 0.95
80% 0.11 0.37 0.62 0.78 1.07 1.30
70% 0.17 0.43 0.80 1.14 1.71 2.30
60% 0.31 0.95 1.65 2.27 3.26 4.31
50% 0.40 1.30 2.31 3.29 4.80 6.57
40% 0.87 2.43 4.60 6.56 9.53 12.15
30% 1.67 4.54 8.37 11.52 16.28 20.32
20% 3.16 7.72 14.43 19.69 26.92 31.42
10% 6.76 14.45 26.53 36.09 45.74 51.14

Image count that the hamming distance is bigger than 6

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 0 1 1 1 2 4
80% 2 3 7 10 12 17
70% 4 6 14 20 27 38
60% 6 13 23 35 50 76
50% 11 22 42 59 83 129
40% 19 58 99 140 207 297
30% 27 102 195 286 425 577
20% 79 227 441 612 896 1091
10% 249 579 1064 1475 1907 2159

Percentage of images that the hamming distance is bigger than 6

  5000 < 4000 < 3000 < 2000 < 1000 < < 1000
90% 0.00 0.01 0.01 0.01 0.03 0.05
80% 0.03 0.04 0.09 0.13 0.16 0.23
70% 0.05 0.08 0.19 0.27 0.36 0.51
60% 0.08 0.17 0.31 0.47 0.67 1.02
50% 0.15 0.29 0.56 0.79 1.11 1.73
40% 0.25 0.78 1.32 1.87 2.77 3.97
30% 0.36 1.36 2.61 3.83 5.69 7.72
20% 1.06 3.04 5.90 8.19 11.99 14.60
10% 3.33 7.75 14.23 19.73 25.51 28.88

 
 

Conclusion

The result shows when the image is resized, there could be some images that are cannot detected. Possible solution is resizing the image to a certain size when the image is bigger than the size before hashing. I tested when the size is 2000, 1500, and 1000 width.

64bit unsigned long long type transfer between Javascript and C++ Daemon

Currently, APIs to add and match image license get a pHash value that are extracted from image. This hash value is 64bit binary. For the fast processing, database and C++ daemon used it as unsigned long long type. However, recently, while Anna is developing Javascript pHash module, there was a problem. When Javascript calculation print the output hash value, last 4 or 5 characters were wrong values. That was because maximum value of number in javascript was 2^53.

  • Max value of integer in Javascript :
    2^53 : 9007199254740992 : 0x20000000000000

  • Max value of unsigned long long :
    2^64 : 18446744073709551615 : 0xFFFFFFFFFFFFFFFF

There are two solutions:

  1. Using Big integer library like http://silentmatt.com/biginteger/
  2. Using Hexadecimal String for output

First solution has a benefit : another modules do not have to be changed. Second solution’s benefit is that doesn’t need additional Javascript library.
We decided to use solution 2, because

  1. hash value is used only to be sent to php API page
  2. do not need calculation
  3. later, when another hash algorithm is used, it can be much longer
  4. when additional Javascript library is used, client implementation will be slower.

After adopting this solution, following modules are affected.

  • javascript : added code to change from binary string to hexadecimal string

  • phash : hash generator from image
    I changed the code from generating integer string to generating hexadecimal string.

//printf("%llun", tmphash);
printf("%016llXn", tmphash);
  • hamming : hamming distance calculator from two hash values
    I changed it to get hexadecimal string :
//    ulong64 hash1 = strtoull(argv[1], NULL, 10);
//    ulong64 hash2 = strtoull(argv[2], NULL, 10);
    ulong64 hash1 = strtoull(argv[1], NULL, 16);
    ulong64 hash2 = strtoull(argv[2], NULL, 16);
  • regdaemon : C++ daemon
    I changed add/match command so it gets hexadecimal string.
//uint64_t uiHash = std::stoull(strHash);
uint64_t uiHash = std::stoull(strHash, 0, 16);

php API doesn’t have to changed because it bypasses by base64 encoding.

For MySQL database field, we decided to keep 64bit unsigned integer type for DCT hash value. That is because this way doesn’t need to be changed from string type to number type to load on the memory for indexing.

libstdc++.so.6 library mismatch problem and solution

Problem

When I tried to run a executable that had been built at other machine, it showed following error :

$ ./regdaemon
./regdaemon: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./regdaemon)

 

Analysis

The reason of this error was because dynamic linking library libstdc++.so.6's version was lower than the library version used in the build machine.

On the build machine, the library is like following:

/usr/lib/x86_64-linux-gnu$ ll libstdc*
lrwxrwxrwx 1 root root      19 Nov  4  2014 libstdc++.so.6 -> libstdc++.so.6.0.20
-rw-r--r-- 1 root root 1011824 Nov  4  2014 libstdc++.so.6.0.20

This means that the library that is actually used by the executable is libstdc++.so.6.0.20 and libstdc++.so.6 links to it. This library is installed with new gcc.

On the other machine that showed error, the library was like following:

/usr/lib/x86_64-linux-gnu $ ll libstdc*
lrwxrwxrwx 1 root root     19 May 14 14:11 libstdc++.so.6 -> libstdc++.so.6.0.19
-rw-r--r-- 1 root root 979056 May 14 14:41 libstdc++.so.6.0.19

libstdc++.so.6 links to libstdc++.so.6.0.19 and it is older version than on the build machine.

 

Solution

Since the machine was linux mint, which was debian, newest gcc can be installed by following command :

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.9

Then the library is updated like this :

/usr/lib/x86_64-linux-gnu $ ll libstdc*
lrwxrwxrwx 1 root root      19 Apr 23 13:00 libstdc++.so.6 -> libstdc++.so.6.0.21
-rw-r--r-- 1 root root 1541600 Apr 23 13:23 libstdc++.so.6.0.21

Now, because installed library was newer than in the build machine, the executable worked well.

The other solution will be linking statically by adding <code>-static-libgcc</code> option.

additional information

Which files(file/socket etc.) are opened by a process can be seen using "lsof" utility.

hosung@hosung-Spectre:~$ lsof -p 6002
COMMAND    PID   USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
regdaemon 6002 hosung  cwd    DIR                8,2     4096 2589221 /home/hosung/cdot/ccl/regdaemon/Debug
regdaemon 6002 hosung  rtd    DIR                8,2     4096       2 /
regdaemon 6002 hosung  txt    REG                8,2  1066943 2545008 /home/hosung/cdot/ccl/regdaemon/Debug/regdaemon
regdaemon 6002 hosung  mem    REG                8,2    47712 2117917 /lib/x86_64-linux-gnu/libnss_files-2.19.so
regdaemon 6002 hosung  mem    REG                8,2    14664 2117927 /lib/x86_64-linux-gnu/libdl-2.19.so
regdaemon 6002 hosung  mem    REG                8,2   100728 2101352 /lib/x86_64-linux-gnu/libz.so.1.2.8
regdaemon 6002 hosung  mem    REG                8,2  1071552 2117915 /lib/x86_64-linux-gnu/libm-2.19.so
regdaemon 6002 hosung  mem    REG                8,2  3355040 6921479 /usr/lib/x86_64-linux-gnu/libmysqlclient.so.18.0.0
regdaemon 6002 hosung  mem    REG                8,2  1840928 2117938 /lib/x86_64-linux-gnu/libc-2.19.so
regdaemon 6002 hosung  mem    REG                8,2    92504 2097171 /lib/x86_64-linux-gnu/libgcc_s.so.1
regdaemon 6002 hosung  mem    REG                8,2  1011824 6846284 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
regdaemon 6002 hosung  mem    REG                8,2  1112840 6830431 /usr/lib/libmysqlcppconn.so.7.1.1.3
regdaemon 6002 hosung  mem    REG                8,2   141574 2117939 /lib/x86_64-linux-gnu/libpthread-2.19.so
regdaemon 6002 hosung  mem    REG                8,2   149120 2117935 /lib/x86_64-linux-gnu/ld-2.19.so
regdaemon 6002 hosung    0u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    1u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    2u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    3u  IPv4              63342      0t0     TCP localhost:60563->localhost:mysql (ESTABLISHED)
regdaemon 6002 hosung    4u  unix 0x0000000000000000      0t0  114861 /tmp/cc.daemon.sock

CC Image License Search Engine API Implementation

Picture

CC - New Page

UI

Previously, my colleague Anna made a page that search similar images by uploading or from the link. This UI page can be either inside the server or outside the sever. It uses only PHP API without accessing Database directly.
 

PHP API

This is open API that have functions of Adding, Deleting, and Matching image. It can be accessed by anyone who want this function. UI page or client implementation such as browser extension uses this API. The matching result is JSON format.
This API page Add/Delete/Match by asking “C++ Daemon” without changing Database.
Only for read-only access to the Database will be permitted.
 

C++ Daemon

All adding/deleting operation will be done in this daemon. By doing so, we can remove the problem of synchronization between database and index for matching. That is because this daemon will have content index on the memory all the time for fast matching.
Because this daemon is active all the time, to get the request and give result to “PHP API”, it works as domain socket server. PHP API will request using domain socket.
 

MySQL

Database contains all metadatas about CC license images and thumbnail path that are used to show as a preview in the matching result.

Pastec Test for Performance

So far, I tested Pastec in terms of the quality of image matching. In this posting, I tested speed of adding and searching.

Adding images to index

Firstly I added 100 images. Adding 100 images took 48.339 seconds. Then I added all directory from 22 to 31. Those images are uploaded to wikimedia commons from 2013.12.22 to 2013.12.21.

Directory Start End Duration Count Average
22 17:32:42 18:43:50 01:11:08 8785 00:00.49
23 18:43:50 19:42:03 00:58:13 7314 00:00.48
24 19:42:03 20:28:56 00:46:53 6001 00:00.47
25 20:28:57 21:28:02 00:59:05 7783 00:00.46
26 21:28:02 22:41:12 01:13:10 9300 00:00.47
27 22:41:19 23:54:28 01:13:09 9699 00:00.45
28 00:54:28 01:53:23 00:58:55 7912 00:00.45
29 00:53:23 02:27:42 01:34:19 11839 00:00.48
30 02:27:42 03:31:48 01:04:06 8827 00:00.44
31 03:31:48 04:23:15 00:51:27 6880 00:00.45

Average time for adding an image was around 0.46 second and it didn’t increased as the index grows. Most of the time for adding an image is extracting features.
I saved the index file for 100 images, from 22 to 26, and from 22 to 31. The size were 8.7mb, 444.1mb, and 935.8mb respectively.

 

Searching images

I loaded the index file for 100 images. And searched all 100 images that are used to add.

Directory Start End Duration Count Average
22 00:01:14 100 00:00.74

Searching took 1m14.781s. Since it is 100 images, average time to add one image was 0.74 second.

Then I loaded the index file that contains index for 39,183 images in the directory from 22 to 26.

Directory Start End Duration Count Average
22 09:00:05 11:21:06 02:21:01 8785 00:00.96
23 11:21:06 13:13:52 01:52:46 7314 00:00.93
24 13:13:52 14:48:26 01:34:34 6001 00:00.95
25 14:48:26 16:48:44 02:00:18 7783 00:00.93
26 16:48:44 19:13:11 02:24:27 9300 00:00.93

This time, average time for searching one image was 0.95 second.

Then I loaded the index file that contains index for 84,340 images that are in the directory from 22 to 31.

Directory Start End Duration Count Average
22 19:32:54 22:44:09 03:11:15 8785 00:01.31
23 20:44:09 23:16:59 02:32:50 7314 00:01.25
24 01:16:59 03:24:52 02:07:53 6001 00:01.28
25 03:24:52 06:11:33 02:46:41 7783 00:01.28
26 06:11:33 09:30:53 03:19:20 9300 00:01.29

Searching performed for the same images from 22 to 26. Average time for searching was 1.3 seconds.

Conclusion

  • Adding an image took 0.47 second.
  • Adding time didn’t varied by index size.
  • Searching an image varied by index size.
  • When the index size was 100, 39183, and 84340, searching time was 0.74, 0.95, and 1.3 seconds, respectively.
    Screenshot from 2015-06-28 23:14:15
    In the chart, y-axis is time in milliseconds. Around 0.6 second is likely to be for reading an image and extracting features. And searching time will be increased in proportion to the size of index.

scp/sftp through ssh turnnel

SSH Tunneling

Machine CC can be connected from another machine called zenit.
To do scp to CC through zenit, following command establish a ssh tunnel to CC.

ssh -L 9999:[address of CC known to zenit]:22 [user at zenit]@[address of zenit]
in my case,
ssh -L 9999:10.10.10.123:22 user1234567@zenit.server.ca

Now, 9999 port of localhost(127.0.0.1) is for tunnel to CC through zenit.
This session need to be alive to do all followings.

 

SCP through the SSH Tunnel

Then these commands do scp from local test.png file to CC:~/tmp and copy from CC:/tmp/test.png to ..

scp -P 9999 test.png ccuser@127.0.0.1:~/tmp
scp -P 9999 ccuser@127.0.0.1:~/tmp/test.png .

 

Making it easy

Typing those long command is not a good idea.
I added an alias to .bashrc.

alias ccturnnel='ssh -L 9999:10.10.10.123:22 user1234567@zenit.server.ca'

Then wrote two simple bash script.

This is cpfromcc.

#!/bin/bash
var=$(echo $1 | sed 's/\/home\/hosung/~/g')
remote=$var
scp -P 9999 ccuser@127.0.0.1:$remote $2

This is cptocc.

#!/bin/bash
values=""
remote=""
i=1
for var in "$@"
do
    if [ $i -ne $# ]
    then
        values="$values $var"
    else
        var=$(echo $var | sed 's/\/home\/hosung/~/g')
        remote=$var
    fi
    ((i++))
done
scp -P 9999 $values ccuser@127.0.0.1:$remote

The reason why I use sed for remote path is because bash changes ~ to my home directory.
Now I can establish ssh tunnel by typing ccturnnel.
Then I can do scp from my machine to CC using :

cptocc test.jpg test2.jpg ~

And I can do scp from CC to my machine using :

cpfromcc ~/remotefile.txt .

 

Making it convenient using sftp

When the tunnel is established, sftp is the same.

$ sftp ccuser@127.0.0.1:9999

 

Making it more convenient using Krusader

By typing sftp://ccuser@127.0.0.1:9999 in the URL bar of the Krusader, and then by adding the place to the bookmark, the remote machine’s file system is easily accessed.

Screenshot from 2015-06-26 10:23:39

Mounting it using sshfs also will be possible.