Posted by

Posted on

August 22, 2015

Posted under

Comments

DCT HASH MATCHING QUALITY FOR RESIZED IMAGES 2

pHash does it’s mathematical operations for every pixels for original image size. Therefore, when the image is resized, the result is slightly different depending on image size. My assumption is that if every image is resized to certain size when the image is bigger than the size, the general matching quality would be better.

I tested the same set of image samples with previous posting, however, because of the speed, the comparison performed for 3644 images.

To find which size is good for normalization, I resized images to 2000, 1500, and 1000 width. And hamming distance between resized image to from 90% to 10%.

Hamming Distance is bigger than 4

normalization size 2000

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000		5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
100%	0	0	0	0	0	1	100%	0.00	0.00	0.00	0.00	0.00	0.01
90%	0	0	0	0	6	17	90%	0.00	0.00	0.00	0.00	0.08	0.23
80%	0	0	0	1	12	19	80%	0.00	0.00	0.00	0.01	0.16	0.25
70%	0	0	0	1	18	36	70%	0.00	0.00	0.00	0.01	0.24	0.48
60%	0	0	0	12	48	87	60%	0.00	0.00	0.00	0.16	0.64	1.16
50%	0	0	3	26	77	141	50%	0.00	0.00	0.04	0.35	1.03	1.89
40%	0	0	9	62	172	272	40%	0.00	0.00	0.12	0.83	2.30	3.64
30%	1	12	54	156	333	475	30%	0.01	0.16	0.72	2.09	4.45	6.35
20%	27	99	246	424	693	851	20%	0.36	1.32	3.29	5.67	9.27	11.38
10%	163	360	753	1093	1442	1636	10%	2.18	4.82	10.07	14.62	19.29	21.89

normalization size 1500

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000		5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
100%	0	0	0	0	0	1	100%	0.00	0.00	0.00	0.00	0.00	0.01
90%	0	0	0	0	2	13	90%	0.00	0.00	0.00	0.00	0.03	0.17
80%	0	0	0	0	7	14	80%	0.00	0.00	0.00	0.00	0.09	0.19
70%	0	0	0	1	15	33	70%	0.00	0.00	0.00	0.01	0.20	0.44
60%	0	0	0	2	25	64	60%	0.00	0.00	0.00	0.03	0.33	0.86
50%	0	0	0	7	46	110	50%	0.00	0.00	0.00	0.09	0.62	1.47
40%	0	0	4	25	123	223	40%	0.00	0.00	0.05	0.33	1.65	2.98
30%	0	0	18	86	247	389	30%	0.00	0.00	0.24	1.15	3.30	5.20
20%	6	27	116	257	520	678	20%	0.08	0.36	1.55	3.44	6.96	9.07
10%	137	308	654	969	1313	1507	10%	1.83	4.12	8.75	12.96	17.57	20.16

normalization size 1000

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000		5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
100%	0	0	0	0	0	1	100%	0.00	0.00	0.00	0.00	0.00	0.01
90%	0	0	0	0	0	11	90%	0.00	0.00	0.00	0.00	0.00	0.15
80%	0	0	0	0	0	7	80%	0.00	0.00	0.00	0.00	0.00	0.09
70%	0	0	0	0	5	23	70%	0.00	0.00	0.00	0.00	0.07	0.31
60%	0	0	0	0	6	45	60%	0.00	0.00	0.00	0.00	0.08	0.60
50%	0	0	0	0	26	90	50%	0.00	0.00	0.00	0.00	0.35	1.20
40%	0	0	0	3	56	156	40%	0.00	0.00	0.00	0.04	0.75	2.09
30%	0	0	2	17	132	274	30%	0.00	0.00	0.03	0.23	1.77	3.67
20%	0	4	39	122	354	512	20%	0.00	0.05	0.52	1.63	4.74	6.85
10%	61	161	406	679	999	1193	10%	0.82	2.15	5.43	9.08	13.36	15.96

Hamming Distance is bigger than 6

normalization size 2000

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000		5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
100%	0	0	0	0	0	0	100%	0.00	0.00	0.00	0.00	0.00	0.00
90%	0	0	0	0	0	1	90%	0.00	0.00	0.00	0.00	0.00	0.01
80%	0	0	0	1	2	3	80%	0.00	0.00	0.00	0.01	0.03	0.04
70%	0	0	0	0	4	11	70%	0.00	0.00	0.00	0.00	0.05	0.15
60%	0	0	0	0	8	21	60%	0.00	0.00	0.00	0.00	0.11	0.28
50%	0	0	0	6	20	46	50%	0.00	0.00	0.00	0.08	0.27	0.62
40%	0	0	4	21	46	94	40%	0.00	0.00	0.05	0.28	0.62	1.26
30%	0	0	11	45	106	175	30%	0.00	0.00	0.15	0.60	1.42	2.34
20%	4	14	63	142	286	381	20%	0.05	0.19	0.84	1.90	3.83	5.10
10%	59	153	347	539	752	869	10%	0.79	2.05	4.64	7.21	10.06	11.63

normalization size 1500

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000		5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
100%	0	0	0	0	0	0	100%	0.00	0.00	0.00	0.00	0.00	0.00
90%	0	0	0	0	0	1	90%	0.00	0.00	0.00	0.00	0.00	0.01
80%	0	0	0	0	0	1	80%	0.00	0.00	0.00	0.00	0.00	0.01
70%	0	0	0	0	2	9	70%	0.00	0.00	0.00	0.00	0.03	0.12
60%	0	0	0	0	6	19	60%	0.00	0.00	0.00	0.00	0.08	0.25
50%	0	0	0	1	10	36	50%	0.00	0.00	0.00	0.01	0.13	0.48
40%	0	0	0	8	28	76	40%	0.00	0.00	0.00	0.11	0.37	1.02
30%	0	0	3	26	81	150	30%	0.00	0.00	0.04	0.35	1.08	2.01
20%	1	4	30	88	221	316	20%	0.01	0.05	0.40	1.18	2.96	4.23
10%	39	99	257	433	639	756	10%	0.52	1.32	3.44	5.79	8.55	10.11

normalization size 1000

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000		5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
100%	0	0	0	0	0	0	100%	0.00	0.00	0.00	0.00	0.00	0.00
90%	0	0	0	0	0	1	90%	0.00	0.00	0.00	0.00	0.00	0.01
80%	0	0	0	0	0	1	80%	0.00	0.00	0.00	0.00	0.00	0.01
70%	0	0	0	0	1	8	70%	0.00	0.00	0.00	0.00	0.01	0.11
60%	0	0	0	0	2	15	60%	0.00	0.00	0.00	0.00	0.03	0.20
50%	0	0	0	0	9	35	50%	0.00	0.00	0.00	0.00	0.12	0.47
40%	0	0	0	0	14	62	40%	0.00	0.00	0.00	0.00	0.19	0.83
30%	0	0	0	4	35	104	30%	0.00	0.00	0.00	0.05	0.47	1.39
20%	0	0	11	39	138	233	20%	0.00	0.00	0.15	0.52	1.85	3.12
10%	14	38	135	270	449	566	10%	0.19	0.51	1.81	3.61	6.01	7.57

Conclusion

According to the test result, in terms of matching percentage, resizing before hashing gives better results; this can be a solution for better matching. However, false positive matching percentage is important.

Posted by

Hosung

Posted on

August 22, 2015

Posted under

Creative Commons

Comments

Leave a comment

DCT Hash matching quality for resized images

DCT Hash in pHash is selected as image similarity search algorithm for Creative Commons image license search. Recently, we found that some images are not matched when they are resized. So, I tested it for flickr CC images.

Firstly, I resized image to 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10%. Those resized image is hashed and calculated hamming distance from 100%. Since image size matters, I categorized images depending of original size to bigger than 5000 pixels width, 4000~5000, 3000~4000, 2000~3000, 1000~2000, and smaller than 1000 pixels.

Total image count was 7475 images.

Image count that the hamming distance is bigger than 4

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
90%	5	21	31	39	53	71
80%	8	28	46	58	80	97
70%	13	32	60	85	128	172
60%	23	71	123	170	244	322
50%	30	97	173	246	359	491
40%	65	182	344	490	712	908
30%	125	339	626	861	1217	1519
20%	236	577	1079	1472	2012	2349
10%	505	1080	1983	2698	3419	3823

Percentage of images that the hamming distance is bigger than 4

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
90%	0.07	0.28	0.41	0.52	0.71	0.95
80%	0.11	0.37	0.62	0.78	1.07	1.30
70%	0.17	0.43	0.80	1.14	1.71	2.30
60%	0.31	0.95	1.65	2.27	3.26	4.31
50%	0.40	1.30	2.31	3.29	4.80	6.57
40%	0.87	2.43	4.60	6.56	9.53	12.15
30%	1.67	4.54	8.37	11.52	16.28	20.32
20%	3.16	7.72	14.43	19.69	26.92	31.42
10%	6.76	14.45	26.53	36.09	45.74	51.14

Image count that the hamming distance is bigger than 6

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
90%	0	1	1	1	2	4
80%	2	3	7	10	12	17
70%	4	6	14	20	27	38
60%	6	13	23	35	50	76
50%	11	22	42	59	83	129
40%	19	58	99	140	207	297
30%	27	102	195	286	425	577
20%	79	227	441	612	896	1091
10%	249	579	1064	1475	1907	2159

Percentage of images that the hamming distance is bigger than 6

	5000 <	4000 <	3000 <	2000 <	1000 <	< 1000
90%	0.00	0.01	0.01	0.01	0.03	0.05
80%	0.03	0.04	0.09	0.13	0.16	0.23
70%	0.05	0.08	0.19	0.27	0.36	0.51
60%	0.08	0.17	0.31	0.47	0.67	1.02
50%	0.15	0.29	0.56	0.79	1.11	1.73
40%	0.25	0.78	1.32	1.87	2.77	3.97
30%	0.36	1.36	2.61	3.83	5.69	7.72
20%	1.06	3.04	5.90	8.19	11.99	14.60
10%	3.33	7.75	14.23	19.73	25.51	28.88

Conclusion

The result shows when the image is resized, there could be some images that are cannot detected. Possible solution is resizing the image to a certain size when the image is bigger than the size before hashing. I tested when the size is 2000, 1500, and 1000 width.

Posted by

Hosung

Posted on

August 11, 2015

Posted under

Creative Commons

Comments

2 Comments

64bit unsigned long long type transfer between Javascript and C++ Daemon

Currently, APIs to add and match image license get a pHash value that are extracted from image. This hash value is 64bit binary. For the fast processing, database and C++ daemon used it as unsigned long long type. However, recently, while Anna is developing Javascript pHash module, there was a problem. When Javascript calculation print the output hash value, last 4 or 5 characters were wrong values. That was because maximum value of number in javascript was 2^53.

Max value of integer in Javascript :
2^53 : 9007199254740992 : 0x20000000000000
Max value of unsigned long long :
2^64 : 18446744073709551615 : 0xFFFFFFFFFFFFFFFF

There are two solutions:

Using Big integer library like http://silentmatt.com/biginteger/
Using Hexadecimal String for output

First solution has a benefit : another modules do not have to be changed. Second solution’s benefit is that doesn’t need additional Javascript library.
We decided to use solution 2, because

hash value is used only to be sent to php API page
do not need calculation
later, when another hash algorithm is used, it can be much longer
when additional Javascript library is used, client implementation will be slower.

After adopting this solution, following modules are affected.

javascript : added code to change from binary string to hexadecimal string
phash : hash generator from image
I changed the code from generating integer string to generating hexadecimal string.

//printf("%llun", tmphash);
printf("%016llXn", tmphash);

hamming : hamming distance calculator from two hash values
I changed it to get hexadecimal string :

//    ulong64 hash1 = strtoull(argv[1], NULL, 10);
//    ulong64 hash2 = strtoull(argv[2], NULL, 10);
    ulong64 hash1 = strtoull(argv[1], NULL, 16);
    ulong64 hash2 = strtoull(argv[2], NULL, 16);

regdaemon : C++ daemon
I changed add/match command so it gets hexadecimal string.

//uint64_t uiHash = std::stoull(strHash);
uint64_t uiHash = std::stoull(strHash, 0, 16);

php API doesn’t have to changed because it bypasses by base64 encoding.

For MySQL database field, we decided to keep 64bit unsigned integer type for DCT hash value. That is because this way doesn’t need to be changed from string type to number type to load on the memory for indexing.

Posted by

Hosung

Posted on

July 15, 2015

Posted under

Creative Commons, Linux

Comments

1 Comment

libstdc++.so.6 library mismatch problem and solution

Problem

When I tried to run a executable that had been built at other machine, it showed following error :

$ ./regdaemon
./regdaemon: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./regdaemon)

Analysis

The reason of this error was because dynamic linking library libstdc++.so.6's version was lower than the library version used in the build machine.

On the build machine, the library is like following:

/usr/lib/x86_64-linux-gnu$ ll libstdc*
lrwxrwxrwx 1 root root      19 Nov  4  2014 libstdc++.so.6 -> libstdc++.so.6.0.20
-rw-r--r-- 1 root root 1011824 Nov  4  2014 libstdc++.so.6.0.20

This means that the library that is actually used by the executable is libstdc++.so.6.0.20 and libstdc++.so.6 links to it. This library is installed with new gcc.

On the other machine that showed error, the library was like following:

/usr/lib/x86_64-linux-gnu $ ll libstdc*
lrwxrwxrwx 1 root root     19 May 14 14:11 libstdc++.so.6 -> libstdc++.so.6.0.19
-rw-r--r-- 1 root root 979056 May 14 14:41 libstdc++.so.6.0.19

libstdc++.so.6 links to libstdc++.so.6.0.19 and it is older version than on the build machine.

Solution

Since the machine was linux mint, which was debian, newest gcc can be installed by following command :

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.9

Then the library is updated like this :

/usr/lib/x86_64-linux-gnu $ ll libstdc*
lrwxrwxrwx 1 root root      19 Apr 23 13:00 libstdc++.so.6 -> libstdc++.so.6.0.21
-rw-r--r-- 1 root root 1541600 Apr 23 13:23 libstdc++.so.6.0.21

Now, because installed library was newer than in the build machine, the executable worked well.

The other solution will be linking statically by adding <code>-static-libgcc</code> option.

additional information

Which files(file/socket etc.) are opened by a process can be seen using "lsof" utility.

hosung@hosung-Spectre:~$ lsof -p 6002
COMMAND    PID   USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
regdaemon 6002 hosung  cwd    DIR                8,2     4096 2589221 /home/hosung/cdot/ccl/regdaemon/Debug
regdaemon 6002 hosung  rtd    DIR                8,2     4096       2 /
regdaemon 6002 hosung  txt    REG                8,2  1066943 2545008 /home/hosung/cdot/ccl/regdaemon/Debug/regdaemon
regdaemon 6002 hosung  mem    REG                8,2    47712 2117917 /lib/x86_64-linux-gnu/libnss_files-2.19.so
regdaemon 6002 hosung  mem    REG                8,2    14664 2117927 /lib/x86_64-linux-gnu/libdl-2.19.so
regdaemon 6002 hosung  mem    REG                8,2   100728 2101352 /lib/x86_64-linux-gnu/libz.so.1.2.8
regdaemon 6002 hosung  mem    REG                8,2  1071552 2117915 /lib/x86_64-linux-gnu/libm-2.19.so
regdaemon 6002 hosung  mem    REG                8,2  3355040 6921479 /usr/lib/x86_64-linux-gnu/libmysqlclient.so.18.0.0
regdaemon 6002 hosung  mem    REG                8,2  1840928 2117938 /lib/x86_64-linux-gnu/libc-2.19.so
regdaemon 6002 hosung  mem    REG                8,2    92504 2097171 /lib/x86_64-linux-gnu/libgcc_s.so.1
regdaemon 6002 hosung  mem    REG                8,2  1011824 6846284 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20
regdaemon 6002 hosung  mem    REG                8,2  1112840 6830431 /usr/lib/libmysqlcppconn.so.7.1.1.3
regdaemon 6002 hosung  mem    REG                8,2   141574 2117939 /lib/x86_64-linux-gnu/libpthread-2.19.so
regdaemon 6002 hosung  mem    REG                8,2   149120 2117935 /lib/x86_64-linux-gnu/ld-2.19.so
regdaemon 6002 hosung    0u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    1u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    2u   CHR             136,23      0t0      26 /dev/pts/23
regdaemon 6002 hosung    3u  IPv4              63342      0t0     TCP localhost:60563->localhost:mysql (ESTABLISHED)
regdaemon 6002 hosung    4u  unix 0x0000000000000000      0t0  114861 /tmp/cc.daemon.sock

Posted by

Hosung

Posted on

July 3, 2015

Posted under

Creative Commons

Comments

Leave a comment

CC Image License Search Engine API Implementation

Picture

UI

Previously, my colleague Anna made a page that search similar images by uploading or from the link. This UI page can be either inside the server or outside the sever. It uses only PHP API without accessing Database directly.

PHP API

This is open API that have functions of Adding, Deleting, and Matching image. It can be accessed by anyone who want this function. UI page or client implementation such as browser extension uses this API. The matching result is JSON format.
This API page Add/Delete/Match by asking “C++ Daemon” without changing Database.
Only for read-only access to the Database will be permitted.

C++ Daemon

All adding/deleting operation will be done in this daemon. By doing so, we can remove the problem of synchronization between database and index for matching. That is because this daemon will have content index on the memory all the time for fast matching.
Because this daemon is active all the time, to get the request and give result to “PHP API”, it works as domain socket server. PHP API will request using domain socket.

MySQL

Database contains all metadatas about CC license images and thumbnail path that are used to show as a preview in the matching result.

Posted by

Hosung

Posted on

June 28, 2015

Posted under

Creative Commons

Comments

Leave a comment

Pastec Test for Performance

So far, I tested Pastec in terms of the quality of image matching. In this posting, I tested speed of adding and searching.

Adding images to index

Firstly I added 100 images. Adding 100 images took 48.339 seconds. Then I added all directory from 22 to 31. Those images are uploaded to wikimedia commons from 2013.12.22 to 2013.12.21.

Directory	Start	End	Duration	Count	Average
22	17:32:42	18:43:50	01:11:08	8785	00:00.49
23	18:43:50	19:42:03	00:58:13	7314	00:00.48
24	19:42:03	20:28:56	00:46:53	6001	00:00.47
25	20:28:57	21:28:02	00:59:05	7783	00:00.46
26	21:28:02	22:41:12	01:13:10	9300	00:00.47
27	22:41:19	23:54:28	01:13:09	9699	00:00.45
28	00:54:28	01:53:23	00:58:55	7912	00:00.45
29	00:53:23	02:27:42	01:34:19	11839	00:00.48
30	02:27:42	03:31:48	01:04:06	8827	00:00.44
31	03:31:48	04:23:15	00:51:27	6880	00:00.45

Average time for adding an image was around 0.46 second and it didn’t increased as the index grows. Most of the time for adding an image is extracting features.
I saved the index file for 100 images, from 22 to 26, and from 22 to 31. The size were 8.7mb, 444.1mb, and 935.8mb respectively.

Searching images

I loaded the index file for 100 images. And searched all 100 images that are used to add.

Directory	Start	End	Duration	Count	Average
22			00:01:14	100	00:00.74

Searching took 1m14.781s. Since it is 100 images, average time to add one image was 0.74 second.

Then I loaded the index file that contains index for 39,183 images in the directory from 22 to 26.

Directory	Start	End	Duration	Count	Average
22	09:00:05	11:21:06	02:21:01	8785	00:00.96
23	11:21:06	13:13:52	01:52:46	7314	00:00.93
24	13:13:52	14:48:26	01:34:34	6001	00:00.95
25	14:48:26	16:48:44	02:00:18	7783	00:00.93
26	16:48:44	19:13:11	02:24:27	9300	00:00.93

This time, average time for searching one image was 0.95 second.

Then I loaded the index file that contains index for 84,340 images that are in the directory from 22 to 31.

Directory	Start	End	Duration	Count	Average
22	19:32:54	22:44:09	03:11:15	8785	00:01.31
23	20:44:09	23:16:59	02:32:50	7314	00:01.25
24	01:16:59	03:24:52	02:07:53	6001	00:01.28
25	03:24:52	06:11:33	02:46:41	7783	00:01.28
26	06:11:33	09:30:53	03:19:20	9300	00:01.29

Searching performed for the same images from 22 to 26. Average time for searching was 1.3 seconds.

Conclusion

Adding an image took 0.47 second.
Adding time didn’t varied by index size.
Searching an image varied by index size.
When the index size was 100, 39183, and 84340, searching time was 0.74, 0.95, and 1.3 seconds, respectively.

In the chart, y-axis is time in milliseconds. Around 0.6 second is likely to be for reading an image and extracting features. And searching time will be increased in proportion to the size of index.

Posted by

Hosung

Posted on

June 25, 2015

Posted under

bash, Linux

Comments

Leave a comment

scp/sftp through ssh turnnel

SSH Tunneling

Machine CC can be connected from another machine called zenit.
To do scp to CC through zenit, following command establish a ssh tunnel to CC.

ssh -L 9999:[address of CC known to zenit]:22 [user at zenit]@[address of zenit]
in my case,
ssh -L 9999:10.10.10.123:22 user1234567@zenit.server.ca

Now, 9999 port of localhost(127.0.0.1) is for tunnel to CC through zenit.
This session need to be alive to do all followings.

SCP through the SSH Tunnel

Then these commands do scp from local test.png file to CC:~/tmp and copy from CC:/tmp/test.png to ..

scp -P 9999 test.png ccuser@127.0.0.1:~/tmp
scp -P 9999 ccuser@127.0.0.1:~/tmp/test.png .

Making it easy

Typing those long command is not a good idea.
I added an alias to .bashrc.

alias ccturnnel='ssh -L 9999:10.10.10.123:22 user1234567@zenit.server.ca'

Then wrote two simple bash script.

This is cpfromcc.

#!/bin/bash
var=$(echo $1 | sed 's/\/home\/hosung/~/g')
remote=$var
scp -P 9999 ccuser@127.0.0.1:$remote $2

This is cptocc.

#!/bin/bash
values=""
remote=""
i=1
for var in "$@"
do
    if [ $i -ne $# ]
    then
        values="$values $var"
    else
        var=$(echo $var | sed 's/\/home\/hosung/~/g')
        remote=$var
    fi
    ((i++))
done
scp -P 9999 $values ccuser@127.0.0.1:$remote

The reason why I use sed for remote path is because bash changes ~ to my home directory.
Now I can establish ssh tunnel by typing ccturnnel.
Then I can do scp from my machine to CC using :

cptocc test.jpg test2.jpg ~

And I can do scp from CC to my machine using :

cpfromcc ~/remotefile.txt .

Making it convenient using sftp

When the tunnel is established, sftp is the same.

$ sftp ccuser@127.0.0.1:9999

Making it more convenient using Krusader

By typing sftp://ccuser@127.0.0.1:9999 in the URL bar of the Krusader, and then by adding the place to the bookmark, the remote machine’s file system is easily accessed.

Mounting it using sshfs also will be possible.

Posted by

Hosung

Posted on

June 24, 2015

Posted under

Creative Commons

Comments

Leave a comment

Pastec Test for real image data

In the previous test of Pastec, I used 900 jpeg image that was mainly computer generated images. This time, I tested images from WikiMedia Commons Archive of CC License Image that are uploaded from 2013-12-25 to 2013-12-30. They are zip file 17GB to 41GB and it contains around 10,000 files including jpg, gif, png, tiff, ogg, pdf, djvu, svg, and webm. Before testing, I deleted xml, pdf, djvu and webm. Then there are 55,643 images.

Indexing

Indexing 55,643 images took around 12 hours and Index file was 622mb. At first, I made separate index files for each day. However, Pastec can load only 1 index file. So I added all 6 days’ images and saved it to one index file.

While indexing there are some errors.

Pastec uses OpenCV, and OpenCV doesn’t support gif and svg. For these two format, OpenCV didn’t open.
Pastec adds images that is bigger than 150×150 pixel.
There are zero bytes images : 153 files in 55,643 files. However on the web page of wikimedia, there are valid images. Anyways it causes an error.
One tiff image cause crash inside the Pastec. It need debugging.

Searching

After loading the 622 mb index file, images can be searched. Searching 55,643 images took around 15 hours. Every searching process, it extracts features before searching, therefore, searching takes more time.

Search result

Among 55,643 images, 751 images(1.43%) are smaller than 150×150, so they were not added. 51479 images are proper size, proper format for OpenCV, they are indexed and can be searched.

42,931 (83%) images are matched with only themselves (exactly the same image)
8,459 (15%) images are matched more than one image
90 (0.17%) images are not matched with any images even with themselves.

Images didn’t match with any images

These 90 images are properly indexed, but didn’t match even with themselves.

55 images were png image that include transparency. Other than this case, jpg images
14 images were long panorama images like followings

6 images were simple images like followings

8 vague images : lines are not clear and photographs that are out of focus

Other cases

These two images are a bit out of focus.

Original image size of this is 150×150 pixel. May be it is too small and simple.

Images matched with more than one image

8,459 images were matched with more than one images. To compare the result, I generated an html file that shows all match results like following :

I converted all images to 250×250 pixel using convert -resize 250x250 filename command to show it on one page. The html file size was 6.8 mb and it shows 64,630 images.

As I mentioned on my previous blog, Pastec is good for detecting rotated/cropped image.
Almost all matching was reasonable(similar). Followings are significant matchings :

In these two cases, the logo was matched.

This matching looks like false positive.

This matching also is false positive.

In this case, the map is shifted.

This is obvious false positive, maybe sharp part of the airplane and the roof part was matched.

From my observation, obvious false positive matching that doesn’t share any object was less than 50, which was 0.08%. Usually when the image contains graphs or documents, there were wrong matching. When the image was normal photograph, the result was very reliable.

Posted by

Hosung

Posted on

June 18, 2015

Posted under

Creative Commons

Comments

Leave a comment

Pastec analysis

Pastec works as following order :

Load visual words : visualWordsORB.dat file contains it, the size is 32,000,000 bytes. Loading the file takes around 1 seconds.
Building the word index : using the visual words, builds word index; it takes around 13 seconds.
Now previously saved index file can be loaded, or an image can be added to the index.
Using an image file, similar image file that contains similar word indexes can be searched.
Index in the memory can be written to a file

Adding new image to the index works as following order :

Using OpenCV, ORB features are extracted.
Matching visual words are searched.
Matching visual words are indexed on the memory

When I added 900 images, the size of index file was 16,967,440 bytes.

By changing source code, I saved matching visual word list to the text file for each images. Each word matching stored using this struct :

struct HitForward
{
    u_int32_t i_wordId;
    u_int32_t i_imageId;
    u_int16_t i_angle;
    u_int16_t x;
    u_int16_t y;
};

Each word matching has word id, image id, angle, and x/y coordination. Saved file looks like this (order of ImageID,Angle,x,y,WordId) :

469,55772,417,111,99042
469,46096,424,453,261282
469,4246,866,265,40072
469,44288,855,295,635378
469,59150,735,268,28827
469,12526,529,112,139341
469,12513,500,39,172187
469,48546,615,59,288827

It contains 1593 lines, which means it has 1593 matching words. Image id 469 was Jánské.jpg and the image looks like this :

The size of this image is 12.8 mb. Like other HDR images, it contains lots of features. Also it has biggest number of matching words among 900 images. When the data was written to the text file, the size was 39,173 bytes, it would be the worst case. When the image is simple, only few words are matched. Full size of matching word text files of 900 images was 20.9 mb.

To reduce it, I made a simple binary format. Since the image id is the same for an image, I wrote it once, and it is followed by 4 bytes count. Then every word is written as 4 bytes word id, 2 bytes angle, 2 bytes x, and 2 bytes y.

4 bytes - id
4 bytes - count
4,2,2,2 (10 bytes) *  count

In case of id 469 image, the size is 11,238 bytes. And the file looks like this :

00000000: d501 0000 3906 0000 e282 0100 dcd9 a101  ....9...........
00000010: 6f00 a2fc 0300 10b4 a801 c501 889c 0000  o...............
00000020: 9610 6203 0901 f2b1 0900 00ad 5703 2701  ..b.........W.'.
00000030: 9b70 0000 0ee7 df02 0c01 4d20 0200 ee30  .p........M ...0
00000040: 1102 7000 9ba0 0200 e130 f401 2700 3b68  ..p......0..'.;h
00000050: 0400 a2bd 6702 3b00 b094 0800 c64c 5f02  ....g.;......L_.

0x1d5 is 469 and 0x639 is 1593.
In this case, the size was 15938 bytes, which was 15 kb, around 34% of text format (39 kb).
Since this image is the worst case, storing all binary index to database for all image record is realistic.
Full size of all 900 images was 8.5 mb. (text file was 20.9 mb)
Interestingly, it is smaller than index file for 900 images (16.2 mb)

Conclusion

I was thinking of saving index file. However, saving word list for each image will be the better solution because when it is binary format, it consumes less storage and adding it to the index is very fast. Also, when it is stored as a database field, synchronization between index and database is not a problem.

Posted by

Hosung

Posted on

June 17, 2015

Posted under

Creative Commons, Eclipse

Comments

1 Comment

How to import CMake project in Eclipse CDT4

Currently I am analysing Pastec; it uses CMake as a build system. To split them up, I wanted to analyse it using the functionality of Eclipse.

Pastec can be built using following order.

$ git clone https://github.com/Visu4link/pastec.git
$ mkdir build
$ cd build
$ cmake ../
$ make

To build Pastec in Eclipse CDT, instead of doing “cmake ..”, following order need to be done. (Debug build)

$ cd build
$ cmake -G"Eclipse CDT4 - Unix Makefiles" -D CMAKE_BUILD_TYPE=Debug ..

Then, it can be imported into Eclipse:

Import project using Menu File->Import
Select General->Existing projects into workspace:
Browse where your build tree is and select the root build tree directory(pastec/build). Keep “Copy projects into workspace” unchecked.
You get a fully functional eclipse project

Reference

http://pastec.io/doc#setup
http://www.cmake.org/Wiki/Eclipse_CDT4_Generator