Introducing Trinity: a modern, high performance, elegant IR(search) library
Trinity is a modern C++ information-retrieval (IR) library, for parsing and building queries, indexing documents and other content, running queries, and processing matching documents.
With Trinity, you can build a search engine with no practical scalability bounds. Thanks to its architecture, it’s easy to implement near-real-time indexing, hybrid index storage and partitioning (on-disk, in-memory).
You can mix, match, extend and customise the various Trinity core classes to accomplish whatever you need. It should be trivial to implement most designs, and feasible to implement more advanced systems. If you ‘d like to build something and you are not sure how to go about it though (granted, the documentation is currently very basic), please ask or file an issue.
Trinity is named after Trinity in the Matrix movies and represents the trinity of IR (query, index, search), and the core functionality implemented by the library.
The fundamental traits of this library are simplicity, extensibility, composability, and high performance. The various classes and APIs are carefully designed to that end(in fact, it took more time to come up with them, that than to actually implement the library), although given that there is currently no documentation one would need to browse the codebase to understand the design and architecture. The codebase footprint is very small though(less than 14k LOC) and finding your way around everything shouldn’t be too hard.
Trinity queries can be extremely complex, made up of boolean operators (AND, NOT, OR), parenthesis, tokens, phrases, and other, Trinity specific, operators. Queries are essentially ASTs, which makes it easy to build queries, extend or re-write them, and inspect them. In fact, there is a handy Trinity::rewrite_query() function for efficient and optimal query rewrites you should probably use.
Trinity uses and supports codecs that facilitate access to the index. This means you can build your own, or extend the two included codecs based on Google index encoding and Lucene encoding designs. There are too many possibilities, you should study the codebase for more on that and other subjects.
The execution engine that runs queries is tuned for very high performance. Queries are translated to ‘instructions’, and 4-pass compiler is responsible for that and for optimizing output by combing instructions, breaking them down into microinstructions, grouping, etc in order to come up with the fastest execution plan, requiring as few instructions as possible. In our various test datasets(millions of documents), it took sub-1ms to execute very elaborate queries (the longest it took was 200ms for a very complicated query made up of 100s of terms).
I apologize for the lack of benchmarks, but, like it is the case with documentation, they will become available as soon as we have some time to get to them. In the meantime, we ‘d love it if would help with that.
Trinity is been developed by Phaistos Networks, S.A, and it is open source, licensed under the Apache 2 License. Code contributions and issues/requests reports are welcome. Trinity is available on Github.