Avri Blog

Blog à l'avricot - Lord Of Castle, java, nosql, Utomia, AvriChat, javascript, Json, Css, Mootools, ajax, php...

Aller au contenu | Aller au menu | Aller à la recherche

Stop Bubble | Start

mardi, janvier 17 2012

Choosing the best solution for storing billions of documents (pdf), NoSql and other

What is the best solution to store billions of binary documents ?

Problematic :

  • We want to store billions of binary documents. This storage has to be safe (replicated) and durable.
  • We need backup and a DRP (mostly in case of human error).
  • Versionning might be preferable, but it's not a requirement.
  • The storage will probably be on the cloud
  • Eventual consistency is OK


Here is a list of studied solution, with pros & cons

Cassandra, column-oriented :

pros : great for scale, failover and replication. Great support, books and community. Freat for write
cons : Setup might be complicated. Based on Thrift, which offers no streaming abilities. See http://stackoverflow.com/questions/3911529/storing-binary-data-on-cassandra-just-like-mysql-blob-binary

Hbase, column-oriented :

pros : based on hdfs, designed to scale and replicate, good doc/book. Can be used with haddop tools (Pig, hive etc.). Great for reads.
cons : Setup might be complicated. Isn't really good for binary data, see http://reavely.blogspot.com/2011/05/hbase-scalability-for-binary-data-i.html, http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS, http://www.quora.com/Apache-Hadoop/How-would-HBase-compare-to-Facebooks-Haystack-for-photo-storage.
A solution might be to store link/id in hbase and data in hdfs, but the NameNode keeps an image of the entire file system namespace and file Blockmap in memory, so be careful if you really have tons of docs.

Couple of interesting links : cassandra vs hbase : http://www.roadtofailure.com/2009/10/29/hbase-vs-cassandra-nosql-battle/
Write performance : http://www.quora.com/How-does-HBase-write-performance-differ-from-write-performance-in-Cassandra-with-consistency-level-ALL

Voldemort project, key/value, based on Amazon's Dynamo project

pros : key/value is great for us, built-in versionning. Great replication, failover and sharding
cons : All code processes a value at a time in memory. No cursors or streaming.

Redis : no cluster, doesn't fit our need with its all-in-memory system.

Riak (key/document): not free, sorry :/

Classic file system :

pros : very simple and fast
cons : can't scale (what if your data is bigger than a single disk?), doesn't provide replication, failover and backups

Tokyo Cabinet : can't find a descent doc, I don't know this solution so I can't really say anything. Any inputs are welcome :)

Couch DB (key/document, MVCC, uses REST as interface to DB, might be usefull to directly serve the doc, but we must handle permition in our case)

pros : hight durability (all data and associated indexes are flushed to disk and the transactional commit always leaves the database in a completely consistent state, might be a con because slower than mongodb memory-durability), has great books, but online documentation is a little bit scattered. Great replication, failover and sharding. Has streaming feature.
con : couchDB compaction will have to recopy the binary doc, REST interface could be a con (I just want to serve my data to my client outputstream)

Mongodb with GridFs (key/document, Update in-place, has nice geospatial features)

pros : performance (memory), great durability via journaling, widly used, great book and documentation (!)
Specific support for binary file with GridFs (native versioning, provide streaming file)
Great for replication/sharding with replicats, failover.
Built-in backup with deplayed replicat.
cons : ... Some info about GridFs : http://www.coffeepowered.net/2010/02/17/serving-files-out-of-gridfs/, http://stackoverflow.com/questions/3413115/is-gridfs-fast-and-reliable-enough-for-production



The winner seems to be Mongodb, although CouchDb might be a good solution too. Its delayed replication is a really great feature when you need db backups !
Since I've no experience with couchDb, I think I'll start my poc using GridFs and see if it's a really good choice.

vendredi, janvier 6 2012

gmap asynchronous lib with a nice message/image displayed when no adress can be found

Here is a little script I wrote to use google maps with the lib download asynch.
You can also display a nice error if the address can't be found on gmaps, and customize it using the right css.
It's using some jquery selectors.
Include this script in your page :

KD = {
utils: { dsl: function(src) {
	                        var script = document.createElement("script");
	                        script.type = "text/javascript";
	                        script.src = src;
	                        document.body.appendChild(script);
	                },
}};
KD.gMaps = {
	utils:{
		apiLoaded: false,
		toLoad: [],
		/**
		 * Once the api is loaded, we execute all the methods we save using the waitForApi method of each map instance.
		 */
		callback: function () {
			this.apiLoaded = true ;
			var size = this.toLoad.length;
			for(var i=0;i<size;i++) {
				l = this.toLoad[i]; 
				//Call the saved function of the map instance.
				l.map[l.fun].apply(l.map, l.args);
			}
		}
	},
	Map: function () {
		if(!KD.gMaps.utils.apiLoaded) {
			KD.utils.dsl("http://maps.googleapis.com/maps/api/js?sensor=false&callback=KD.gMaps.utils.callback");
		}
	}
};
 
KD.gMaps.Map.prototype = {
	geocoder: null,
	map: null,
	divMap: null,
	divMapContainer: null,
	/**
	 * Default options.
	 */
	options: {
		addressNotFound: "", //"Address not found"
		/* If the size isn't defined in the options, divMap size will be used. Otherwise the size will be overrided by the options.
		size :{ 
			width: "300",
			height: "500"
		},*/
		//All other google maps options can be used here.
		map: {
			zoom: 14, 
			mapTypeId: "roadmap" //google.maps.MapTypeId.ROADMAP
		}
	},
	/**
	 * Init google map div. Can be called with a "options" extra parameters if needed.
	 */
	init: function (divMap, options) {
		if(!this.isWaitingForApi("init", arguments)) {
			KD.utils.initOptions(options, this.options);
			this.geocoder = new google.maps.Geocoder();
			if(typeof(divMap) !== "undefined") {
				this.divMapContainer = divMap;
			} else if(this.divMapContainer == null) {
				alert("you should init at least once with a divMap");
			}
			var width, height;
			this.divMap = this.divMapContainer.append("<div></div>");
			//Set the map size if.
			var that = this;
			$.each(['height', 'width'], function (i, type) {
				if(typeof(that.options.size) == "undefined" || typeof(that.options.size[type]) == "undefined") {
					value = that.divMapContainer.css(type);
				} else {
					value = that.options.size[type];
					that.divMapContainer.css(type, value);
				}
				that.divMap.css(type, value);
			});
			this.map = new google.maps.Map(this.divMap.get(0), this.options.map);
		}
	},
	/**
	 * Go to the given address.
	 */
	goToAddress: function(address) {
		if(!this.isWaitingForApi("goToAddress", arguments)){
			if(this.divMap == null) {
				this.init();
			}
			if(address == "") {
				this.setAddressNotFound();
			} else {
				var that = this ;
				this.geocoder.geocode( { 'address': address}, function(results, status) {
					if (status == google.maps.GeocoderStatus.OK) {
						that.goToLatLng(results[0].geometry.location);
					} else {
						that.setAddressNotFound();
					}
				});
			}
		}
	},
	/**
	 * Move the map to the given lat/long.
	 */
	goToCoo: function (lat, long) {
		if(!this.isWaitingForApi("goToCoo", arguments)){
			this.goToLatLng(new google.maps.LatLng(lat, long));
		}
	},
	/**
	 * Save the call to the method. If the api isn't loaded yet, the method will be called when the api is loaded.
	 * Return true if the api isn't loaded
	 */
	isWaitingForApi: function (functionName, args) {
		if(!KD.gMaps.utils.apiLoaded) {
			KD.gMaps.utils.toLoad.push({"fun": functionName, "map": this, "args": args}); 
			return true ;
		}
		return false;
	},
	/**
	 * Set the addressNotFound message in the map.
	 */
	setAddressNotFound: function () {
		this.divMap = null;
		this.map = null;
		this.divMapContainer.removeAttr("style");
		this.divMapContainer.html("<p>"+this.options.addressNotFound+"</p>");
	},
	/**
	 * Move the map to the given google's google.maps.LatLng object.
	 */
	goToLatLng: function(myLatlng) {
		this.map.setCenter(myLatlng);
		var marker = new google.maps.Marker({
			map: this.map,
			position: myLatlng
		});
	}
};

One its include, just use it this way to display the address you want :

companyMaps = new KD.company.Maps("Can't find this address"); 
companyMaps.init("8 RUE DE L HOPITAL ST LOUIS, 75010 PARIS, FRANCE");

(see also goToLatLng, goToCoo and goToAddress. You can use this function once the map is load using the ini method.

mercredi, décembre 21 2011

A primitive folder backup

Here is a simple way to backup a folder.
This shell make a tar of a folder and remove backup older than 10 days. Simple but usefull

shell
#!/bin/sh
NOW=$(date +"%Y-%m-%d")
SUFFIX="backup-nuxeo"
tar -czf "/srv/data/backup/$SUFFIX-$NOW.tar.gz"  /srv/data/app/nuxeo-dm-5.4.2-tomcat/nxserver/data/binaries

#Remove old backup files
find /srv/data/backup/ -ctime +10  -name "$SUFFIX*.tar.gz" -exec rm {} \;

cron :

shell
01 02 * * * /srv/data/backup/backup-nuxeo.sh

dimanche, décembre 18 2011

A small google insight api using python

I recently had to use google insight for the google data viz challenge.
Google doesn't provide an api for insight, and you may need to do a request with more than 5 terms.
You will find the connection script, and an example to use it on the http://www.tendance2012.fr project : https://github.com/Avricot/prediction/tree/master/python-insight

Inspired from http://pypi.python.org/pypi/pyGTrends/0.8

find unused properties, image and css styles in a project

Sometimes, you need to remove old properties, css styles and images wich are not used anymore.
Here are 3 python scripts you might need for this. This will print all the properties suspected to be deprecated
unusedImages.py
unusedProperties.py
unusedCss.py

mercredi, août 24 2011

Selenium select doesn't fire onChange event with IE

Selenium select method doesn't fire a onchange event on IE.
Here is a simple hack : we have to focus the select (here located with the "locater" parameter), then change its value, and finally remove the focus. Firering a blur event doesn't work, so we simply give the focus to the body.

public void select(String locator, String optionLocator) {
   if (isIE) {
     //Add the focus to the select
      selenium.focus(locator);
   }
   selenium.select(locator, optionLocator);
   if (isIE) {
      // Remove the focus to fire the onchange()
      selenium.focus("css=body");
   }
}

munin sending alert email configuration

Pour envoyer des alertes email avec munin, il faut configurer le serveur mail Exim4 de debian.
La documentation est ici : http://pkg-exim4.alioth.debian.org/README/README.Debian.html
Et plus particulierement un résumé pour configurer exim avec le serveur smtp de gmail http://wiki.debian.org/GmailAndExim4

On peut lancer la commande

dpkg-reconfigure exim4-config 

et suivre les instructions, ou modifier les fichiers de configuration:

vim update-exim4.conf.conf
# /etc/exim4/update-exim4.conf.conf
#...
dc_eximconfig_configtype='smarthost'
dc_other_hostnames='monHost.com'
dc_local_interfaces='127.0.0.1'
dc_readhost=''
dc_relay_domains=''
dc_minimaldns='false'
dc_relay_nets=''
dc_smarthost='smtp.gmail.com::587'
CFILEMODE='644'
dc_use_split_config='false'
dc_hide_mailname='false'
dc_mailname_in_oh='true'
dc_localdelivery='mail_spool'
vim /etc/mailname
monHost.com

Dans tous les cas, il faut rajouter le login/password du serveur smtp google:

vim /etc/exim4/passwd.client
# password file used when the local exim is authenticating to a remote
# host as a client.

*.google.com:monMail@monHost.com:monPassword

*.google.com

Et finalement donner les bon droits à exim:

chown root:Debian-exim /etc/exim4/passwd.client

Configuration des plugins

Ensuite la configuration des limites de munin se fait dans le fichiers de conf de munin: {vim /etc/munin/munin.conf} Par exemple pour changer le niveau d'alerte de df sur monHost :

[monHost]
        address monHost
        use_node_name no
        df._dev_md2.warning 30 #_dev_md2 = /home
        df._dev_md2.critical 40

la syntaxe est nomDuPlugin.label.(critical/warning) value
La valeur du label s'affiche sur l'interface web du plugin sous "Internal name"

Certains plugins sont configurables via le fichiers

vim /etc/munin/plugin-conf.d/munin-node

par exemple pour le plugin http_response_full

[http_response_*]
env.domain http://www.monHost.com
env.url1 /

Pour voir les configuraton par defaut des plugins, on peut lancer

munin-run df config
graph_title Disk usage (in percent)
graph_args --upper-limit 100 -l 0
graph_vlabel %
graph_scale no
graph_category disk
_dev_md1.label /
_dev_md1.warning 92
_dev_md1.critical 98
_dev_md2.label /home
_dev_md2.warning 92
_dev_md2.critical 98

Doc de munin sur l'envoi d'email:
http://munin-monitoring.org/wiki/HowToContact

lundi, août 8 2011

hibernate delete all then re-insert - adding a @OrderColumn and updating db with python

Because we can have duplicated elements in a list, updating a List with hibernate will result to a full delete then re-insert. A complete explanation can be found here : Why Hibernate does "delete all then re-insert" - its not so strange A good solution is to use a Set or to add a index column to your List:

@JoinTable(name = "JOINTURE", joinColumns = @JoinColumn(name = "KEY_A"), inverseJoinColumns = @JoinColumn(name = "KEY_B"))
@ManyToMany(fetch = FetchType.LAZY)
@OrderColumn(name = "ORDERS_INDEX")
private List<MyObject> myList ;

Once the modification is done, you may need to initialize the existing entries. Here is a quick python script

#Pour installer MySQLdb : sudo apt-get install python-mysqldb
import MySQLdb
 
db=MySQLdb.connect(passwd="dev",db="mybb",host="localhost",user="dev")
#db.autocommit(1) 
 
db.query("""SELECT KEY_A, KEY_B from JOINTURE ORDER BY KEY_A""")
r=db.store_result()
 
id_A = 0 
i=0
while 1:
        row = r.fetch_row(how=1)
        if not row: 
                break
        row = row[0]
        i=i+1
        if id_A != row['KEY_A']:
                i=0 
                id_A=int(row['KEY_A']);
        try:
                c=db.cursor()
                c.execute("UPDATE JOINTURE set ORDERS_INDEX=%s where KEY_A=%s and KEY_B=%s", (i, id_A, int(row['KEY_B'])))
        except:
                print sys.exc_info()[1]
db.commit()

dimanche, juillet 10 2011

hibernate : paginationwith collection and the criteria api

Let's say you have an Object TheList, which has a list of Company object: companies, and you want to paginate the companies of a specific list, using the criteria API. A company is composed of an Id, and a List of urls (String).

The first solution I think about is the following :

final Criteria rootCriteria = session.createCriteria(TheList.class, "t");
	    final Criteria companiesCriteria = rootCriteria.createCriteria("t.companies");
	    companiesCriteria.setFirstResult(firstResult).setMaxResults(maxResult);
	    rootCriteria.add(Restrictions.eq("t.id", listId)).add(Restrictions.eq("t.owner.id", ownerId));
	    return rootCriteria.list();

This will return a list of "maxResult" of TheList objects.
Unfortunately, you can't get just the list of companies, hibernate's criteria api can't handle this next query:

final Criteria rootCriteria = session.createCriteria(TheList.class, "t");
	    final Criteria companiesCriteria = rootCriteria.createCriteria("t.companies");
	    companiesCriteria.setFirstResult(firstResult).setMaxResults(maxResult);
	    rootCriteria.add(Restrictions.eq("t.id", listId)).add(Restrictions.eq("t.owner.id", ownerId));
	    rootCriteria.setProjection(Projections.property("t.companies"));
	    return rootCriteria.list();

you will have a nice error :

java.lang.ArrayIndexOutOfBoundsException: 0
	at org.hibernate.loader.criteria.CriteriaLoader.getResultColumnOrRow(CriteriaLoader.java:148)
...

So we can stick to the first solution, and add a built-in transformer:

...
rootCriteria.setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY);

But this doesn't fit our need, since the limit is applied on the TheList entity, and not on the companies. Useless...

The only solution I found was to create a subQuery to select the companies ids, and another main request where we select the companies, based on the id of the first request.
This looks like as following:

final Criteria rootCriteria = session.createCriteria(Company.class, "c").setMaxResults(maxResult).setFirstResult(firstResult) ;
	    final DetachedCriteria subQuery = DetachedCriteria.forClass(Target.class, "t") //
		    .add(Restrictions.eq("t.id", targetId)).add(Restrictions.eq("t.owner.id", ownerId)) //
		    .createAlias("t.companies", "cp") //
		    .setProjection(Projections.property("cp.id"));
	    final List<Company> companies = rootCriteria.add(Subqueries.propertyIn("c.id", subQuery)).list(); //"exists" may be a better solution than "in".

The main concern about this solution is that you can't set the setMaxResult/setFirstResult on a DetachedCriteria, and a Subqueries needs to be a DetachedCriteria. Stupid: the subquery will return all the companies, and we only need a few of them...
We could also make 2 separated queries : get the ids on the first one, and the companies on the second one.

Now that you have a list of company, you will probably need to fetch the list of urls contained in each company, because you want to display them on your web page.

final Criteria rootCriteria = session.createCriteria(Company.class, "c").setMaxResults(maxResult).setFirstResult(firstResult) //
		    .setFetchMode("c.urls", FetchMode.JOIN) ;
	    final DetachedCriteria subQuery = DetachedCriteria.forClass(Target.class, "t") //
		    .add(Restrictions.eq("t.id", targetId)).add(Restrictions.eq("t.owner.id", ownerId)) //
		    .createAlias("t.companies", "cp") //
		    .setProjection(Projections.property("cp.id"));
	    final List<Company> companies = rootCriteria.add(Subqueries.propertyIn("c.id", subQuery)).list(); //"exists" may be a better solution than "in".

Damn ! this doesn't work either ! Not yet implemented with hibernate + criteria, only hql... (fell free to send a patch...) https://hibernate.onjira.com//browse/HHH-869

org.hibernate.MappingException: collection was not an association: com.model.company.Company.urls
...


So the only thing we can do is to create an extra query to fetch all the urls, and map the urls to the companies.
Adding urls to a company in the session will result to an extra update, so I decided to evict the companies from my session (I'm in read-only)
Enjoy !

final List<Long> ids = new ArrayList<Long>();
	    for (final Company c : companies) {
		ids.add(c.getId());
		session.evict(c);
	    }
	    final List<Company> cs = session.createQuery("select c from Company c left join fetch c.urls where id IN (:companiesId)").setParameterList("companiesId", ids).list();

Ok, that's just too painfull.
Let's first get the ids, and then use hql to do the fetch join I need. I'll use a StringBuilder to build dynamically the extra criteria of my request...

Conclusion: Criteria api has some features missing...

jeudi, juin 30 2011

hibernate @Any mapping, just can't fetch the "any" object org.hibernate.type.AnyType cannot be cast to org.hibernate.type.ComponentType

Using hibernate @Any mapping to map multiple entities in the same, I haven't be able to build a request to fetch all the objects of a specific class.
Hibernate doesn't seems to be capable to handle this kind of request.
My application doesn't start and throw the following error :

Invocation of init method failed; nested exception is java.lang.ClassCastException: org.hibernate.type.AnyType cannot be cast to org.hibernate.type.ComponentType

removing the part "left join fetch ic.item i" solve this error.

@Entity
@NamedQueries({
@NamedQuery(name = "GET_ITEMS", query = "from ItemContainer ic left join fetch ic.item i where ic.id=:containerId and ic.class=:clazz")})
public class ItemContainer {
.... 
    @Any(metaColumn = @Column(name = "TYPE"), fetch = FetchType.LAZY)
    @AnyMetaDef(idType = "long", metaType = "string", metaValues = { @MetaValue(targetEntity = Item1.class, value = "I1") })
    @JoinColumn(name = "TYPE_ID")
    private IItem item;
...
}

(Item1 implements IItem)

@Override
@SuppressWarnings("unchecked")
public List<ItemContainer> getItem1s(final Long containerId) {
return getHibernateTemplate().findByNamedQueryAndNamedParam("GET_ITEMS", new String[] { "containerId", "clazz" },
		new Object[] { containerId, Item1.class });
}

:/

Any ideas ?

- page 1 de 7