Apache Solr Vs Elasticsearch: Making the right choice

Noc Folio3

4 years ago

So, you are looking for a best search engine that will meet your requirements? In such situation there are a lot of questions that comes in mind regarding the kind of technology stack aligned with the desired search engine.

Primarily, if we follow the recent trend, we come across the two best search engines that help us to build real time, scalable and high performance app i.e. Solr and Elasticsearch. Both are open source built on text search library called Lucene. More or less these platforms shared same set of features. Yet they differ in terms of functionalities like scalability, indexing, community presence and data fetching techniques. Over the last decade, they have grown exponentially and so far the best choice for using app for search engine.

Since people cannot get over with unanswerable questions, for instance; what makes the better and faster search engine, the one flexible and easy to manage and one that scales better and have more features in its bucket. These are questions and do not have universal and clear answers. It is difficult to choose between these two as both have different salient features. Here, we will look into the factors that makes them distinct in terms of its strengths and weaknesses. Let’s have a case in point where you aimed to build an application that constitutes big data and real time searching. It is important to fetch lot of contents having complex architecture, big data, real time processing and data insights. Therefore, when it comes to making the right choice these factors are important for consideration.

Over the last decade, Elasticsearch is undoubtedly an open-source search engine of choice and far better than Solr. Being said that, whether you are opting for Solr or elasticsearch it entirely depends on business requirements.

Inception, Growth & Maturity

Solr was introduced in 2004, a way before than Elasticsearch by Yonik Seely at CNET Networks. In 2006, it got associated with Apache software foundation where it begins to grow with accumulated features and attracts open-source community. Later on, it started to grow as a stabilize product and frequently got releasing versions on the go. Even though it is still in a development phase, but consistently it became popular among its users and dominated the industry for a long time until in 2010, Elasticsearch was officially introduced.

In 2010, Elasticsearch was created by Shay Bannon formerly known as compass. Primarily it was no way closed to Solr pertaining to its stability and set of features as that of Solr. Due to the fact that Elasticsearch was made on modern principle it was built to cater modern use cases and to achieve real-time search from the heart of Lucene. However, Solr was still not able to achieve NRL as both serves Lucene as underlying search libraries. As, Elasticsearch was new in market and had not much visibility as compare to Solr in market and plus not much pressure in terms of backward compatibility and market share, a lot more can be added in already build features to handle big date searching, scaling and cloud environment. Eventually Elasticsearch go beyond expected and work wonders for its community.

Community & Open Source

Both Solr and Elasticsearch are open-source search engine and are under the umbrella of Apache license. Solr has a broad open-source community and anyone can contribute towards the community. Solr committers are based on merit only. On the other hand, Elasticsearch is partially open source as source code is available on GitHub. Contributors can make changes and request but the final call will be made only by employees of Elasticsearch company. We can say that Elasticsearch derives from a single company instead of community.

Solr and Elasticsearch both are open source management systems and comes with its own pros and cons. As Solr committers come from different environments and communities and share their ideas and contributions which makes Solr vast and having lots of features to unfold. However, Solr code management is not consistent and optimized and one still can contribute and add to its library. On the other hand, Elasticsearch might not get contributed by everyone but having an extra layer of development cycle, quality checks would result in higher consistency and quality.

Installation & Configuration

In Solr you need to configure a managed-schema file (schema.xml in documentation) to configure fields, types, and other attributes. You can also customize already added feature of tokenizer and analyzer or create new to your field types with your prescribed guidelines. We can also define all the fields as dynamic and index a document on the go to be named as schema less core. But you would need some level of structure as the system gets complex to process diversification of content and data. We can also define copy field structure with different data types to have different levels of indexing for different types of operations like searching a field and sorting on others.

Configuration of Elasticsearch is just a matter of a few steps and can be configured using elasticsearch.yml. Although all settings are not only managed by this file but can be managed from a live cluster like placement of your shards and replicas in the cluster and would not require elastic nodes to be restarted. But any changes made to the Solr configuration file would require the core to be reloaded or the Solr node to be restarted. We can have a document indexed without having to define schema in Elasticsearch and it will predict the fields accordingly. In either case, we can also create index mappings files for each type of document that can be indexed and mapped file according to the document being indexed.

API’s, Client API’s and Schema API’s

Elasticsearch and Solr both exposes vast layer of HTTP API’s for not only indexing and querying documents but also managing schema API’s like performing CRUD operation on schema, creating collections, managing Solr cores, create indices, manage them and also get all the metrics about the current state and configuration. All major languages have official support of client libraries like .Net, Python, Java, PHP, JavaScript, Ruby, Perl etc.

For using Solr API you need to define a particular request handler which defines the scope of the request and pass in the query string parameters which varies from request to request. Let’s say in order to fetch search records from Solr you need to pass select request handler and a HTTP. Get request with different query parameters defining the query, sort operations, child document transformation, pagination and response type. Also Solr API is not only about querying and indexing data but also provides a powerful mechanism for operating Solr structures like cores, collections and schema.

One thing that might be handy here is that Solr response is customizable. We can request for the specific response in API requests like we can get responses not only in JSON but in many formats like xml, csv, xlsx and xslt. Solr also provides a powerful dashboard with visualization of all the HTTP APIs exposed and Marvel or Kibana apps for Elasticsearch

If you are familiar with Elasticsearch you can simply consume API using GET, POST, PUT, DELETE. Elasticsearch only communicates in JSON as instead of sending data in query parameters. It represents data as JSON structure and translate response in same language.

Indexing & Searching

Both of the platforms provide plenty of options for indexing like Solr not only accepts data in JSON and XMl but it also receives it from word and PDF files. Elasticsearch also takes data from external sources like Git, Kafka, MongoDB, Redis, AWS SQS. You also have a flexibility to create schema less partially structured or full structured with an ability to create your own field types using analyzers and tokenizers. Also, you can index using user specified mapping (only in Elasticsearch). As both built on the top of the Lucene so both supports full text search but additionally Elasticsearch allows you to perform many types of queries either its aggregation, geo spatial, metrics and time series data. Both support child transformation in their own way like we can define child transformers in Solr to filter out child records for particular parents and vice versa. You can achieve the similar result in Elasticsearch using has_children and top_children

Distributed searching makes it possible to search large volumes of data within NRT. For instance, we have a cluster with multiple nodes and have a couple of shards with document Demo which comprises ten million records in total, splitting in both the shards. Now if we try to look for a keyword ‘level’ the query will be executed in parallel and get back to the user in half of a time. Likewise, you can have a resemblance of a similar concept in programming context called multi-threading or distributed database in database context.

Conclusion

There are multiple takeaways from this above writing;

If your application need to juggle around analytical and metrics data than Elasticsearch would be your first choice

If you are in need of distributed searching or need to extensively scale your application then Elasticsearch is a better option as clusters, shards and nodes all are the key factors of distributed and cloud environment.

If you entirely need open source platform, then Solr would be your better choice as it is completely run by the community.

If your business requirements are to create standard search application with persisting powerful searching capability and not require mass level scalability than Solr would be your first choice

Since both of the giants are excellent open source choices with large and extensive feature sets here your choice purely depends on budget, timing, scenario and complexity of the project.