Current development server works on the LAMP stack. Anna is working on Creative Commons Image crawler and User Interface using PHP/MySQL. For the prototype that works with the PHP UI code and MySQL database, I made an Indexer and Searcher.
The database contains lot’s of records that contains image url, license, and hash values. And that is make by crawler written in PHP.
$ ./mhindexer Usage : mhindexer hostName userName password schema table key value treeFilename hostName : mysql hostname userName : mysql username password : mysql password schema : db name table : table name key : image id field name in the table value : hash field name in the table treeFilename : mvp tree file name Output : treeFilename,datapointCount,elapsedSeconds
The program takes MySQL connection informations : hostname, username, password. And the database information : schema, table, key, value. After connecting using the information, it reads all ‘key’ and ‘value’ fields from the ‘table’. ‘key’ is used as a unique key that points the db record that contains image information : filename, url, hash value, etc. ‘value’ is a hash value that is used to calculate hamming distance.
After connecting to the database, program reads all records that contains hash values. And makes add them to MVP-tree. When the tree is built, it is written to the ‘treeFilename’ file.
I made simple bash script that run mhindexer with parameters. output is :
$ ./mhindexer.sh tree.mh,784,0.035845
From the hashes in the database, the tree is written to tree.mh and there are 784 nodes and it took 0.035845 seconds.
Usage : mhsearcher treeFilename imageFilename radius eg : mhsearcher tree.mh ./test.jpg 0.0005 output : 0-success, 1-failed success : 0,count,id,id,id,... eg : 0,2,101,9801 failed : 1,error string eg : 1,MVP Error
For now, searcher reads the tree file(treeFilename) to generate tree structure, and extracts MH hash from input file(imageFilename), then search the hash value in the tree using ‘radius’.
Output is used by php script. When the first field divided by comma is 0, there is no error and the result is meaningful. Second field is count of detected hashes. And following fields are ids of hashes. Using the ids, php script can get image information from the database.
When the first field is 1, following field is the error message.
To test it, I randomly chose an image that is in the database.
Example output is :
$ ./mhsearcher tree.mh WTW_Nov_2013_Tumanako_023.JPG 0.001 0,0,0.001,8,0.000029 $ ./mhsearcher tree.mh WTW_Nov_2013_Tumanako_023.JPG 0.1 0,0,0.1,778,0.000648 $ ./mhsearcher tree.mh WTW_Nov_2013_Tumanako_023.JPG 0.2 0,1,60,0.2,784,0.000657 $ ./mhsearcher tree.mh WTW_Nov_2013_Tumanako_023.JPG 0.3 0,1,60,0.3,784,0.000658 $ ./mhsearcher tree.mh WTW_Nov_2013_Tumanako_023.JPG 0.44 0,5,539,60,380,188,371,0.44,784,0.000672
For the performance statistics purpose, I added radius, calculation count and extraction time at the end of the result.
In this image’s case, when the radius was 0.2, matching image was found. And when the radius was 0.44, there was 5 results.
- This utilities works well with MySQL and PHP.
- Because of the characteristics of tree search algorithm, repeated search from the radius of 0.001 to 0.5 inside the searcher can be done to get the fast and reliable result.
- Later, indexer and searcher can be changed to linux daemon process to maintain the tree in the memory for fast searching.
- When the amount of database record is enormous(millions ~ billions), the tree can be divided to several sections in the database.