The 37 gigabyte GML file

As GeoServer matures one of the main focuses becomes the ability to scale – dealing with both large amounts of data and large amounts of users. I got a bit of time to play with GeoServer a week or so ago, and wanted to test out a bit of the large data side of things. On our geoserver demo site, we’ve got about 19 gigabytes of data that we’re serving up. It’s all available through the WMS with OpenLayers on the front end, but the data is never exposed all at once. One of GeoServer’s strengths is the WFS, which provides access to the raw vector data, so I wanted to try to download the whole dataset.

GeoServer has some great fundamental design, as its built in such a way that data is never really held in memory, it streams from the database in to GeoTools java objects and then out to the appropriate output format. So in theory we should be able to stream GML from databases of any size. So that’s what I did, and there were absolutely no memory errors or other problems. Due to the verbose nature of GML the file GeoServer produced was about 37 gigabytes – containing road, landmark, and water data for the whole US, and country boundaries and place names for the world. The data was came down at 4.97 MB/s, which I don’t think is too bad for transforming it to GML on the fly. One interesting thing I noticed that with larger datasets there’s a noticeable pause before the data starts returning. With small datasets the streaming nature of GeoServer tends to produce results right away. So we’ll have to do some investigation to see what’s taking that time – hopefully it’s something we have control over and not buried in PostGIS or some such. I do believe that GeoServer can handle databases of any size, and would love to hear reports from people out there working with even bigger sets of data.

In the next couple weeks we’re also going to have Justin set up a better testing suite for scalability, using JMeter. A number of people have done testing against GeoServer with it before, but in a more ad hoc way. This will build the tests in to the source distribution, so that we are sure it gets run with every release, so we don’t have any regressions. There have been many speed improvements of late, and many more to come, so we want to be sure that other changes in the code don’t accidentally affect things. One more tidbit of scalability news, Gabriel reported that he attended a meeting where someone reported some GeoServer benchmarks, successfully supporting 1000 requests.

9 Comments

  1. Posted March 14, 2007 at 11:11 am | Permalink

    And yes, I’m well aware that this is not a ‘real’ test in any way. It was more just for fun. I’d love to see better comparisons with other servers, and if you do, please get in touch with the GeoServer team as we’d be excited to work with you. If we can attach a profiler to your testing environment we can identify where GeoServer is spending lots of time and improve it to respond even more quickly.

  2. shonbh
    Posted March 26, 2007 at 6:16 am | Permalink

    hello all,

    I am going to try geoserver with a big data ( over 2 miilion parcels) . Do you think it will be running like horse ?

  3. Posted March 26, 2007 at 11:45 am | Permalink

    GeoServer should do pretty well with 2 million parcels. The dataset that I was using was well over 2 million rows. If possible we recommend PostGIS, since it’s what we’ve tested the most with. In 1.5.0-RC3 we just added some functionality to have it work even better with huge datasets.

  4. shonbh
    Posted April 1, 2007 at 10:41 pm | Permalink

    Thank for your advice !
    Is there any guide for using Geoserver in building National Spatial Dada Infrastructure ?
    Does Geoserver has enough features for SDI ?

  5. Posted April 2, 2007 at 10:18 am | Permalink

    Well, an SDI involves many pieces of software, and indeed much, much more past just software. But we do believe that GeoServer can play a very vital role in building SDI’s, as a default way for everyone to share their information. However, things like searching for information, having compelling clients to display it, and indeed the non-technical licensing and sharing agreements that make up an SDI are out of the scope of GeoServer. We choose to focus on a few things and do them very, very well.

  6. shonbh
    Posted April 2, 2007 at 8:26 pm | Permalink

    “But we do believe that GeoServer can play a very vital role in building SDI’s, as a default way for everyone to share their information.”
    Many thanks ! That’s all i meaned
    I will consider your advice .

  7. Bryan Hall
    Posted April 10, 2007 at 10:00 pm | Permalink

    A little late to the party but… we just loaded the latest RC against Oracle Spatial (full EE). We have some rather large datasets, and some layers with rather verbose items that we will be hitting for the next week in test. I’ll be sure to drop a note if something seems off, and will watch for any slow startups with large datasets.

    Overall – well done (beats the pants off UMN Mapserver).

  8. John Z
    Posted April 18, 2007 at 8:32 pm | Permalink

    “In the next couple weeks we’re also going to have Justin set up a better testing suite for scalability, using JMeter… This will build the tests in to the source distribution, so that we are sure it gets run with every release, so we don’t have any regressions.”

    This sounds pretty cool. Coupla questons: Will you leave things in the releases that will facilitate end-user benchmarking? Might allow the project to accept consistent benchmark results from the community on a variety of platforms. Will you document the benchmarking process? Any chance the benchmark process could be done a generic fashion to allow identical testing of other servers (Free and otherwise)?

  9. Justin Deoliveira
    Posted April 19, 2007 at 1:19 pm | Permalink

    Hi John,

    The process will be documented soon, most likley before 1.5.1 goes out. And yes you will be able to use the process to benchmark against other wms servers.. The way it works is a script it run against a wms server and that generates the jmeter test suite specific to that server. We also plan to make the configuration (mapserver + geoserver ) + data used in the benchmark available. So stay tuned 🙂

Download GeoServer