Full Text Search Webinar Questions Followup

I presented a webinar this week to give an overview of several Full Text Search solutions and compare their performance. Even if you missed the webinar, you can register for it, and you’ll be emailed a link to the recording.

During my webinar, a number of attendees asked some good questions. Here are their questions and my answers.

Adrian B. commented:

Q: Would’ve been a good idea to retrieve the same number of rows on each benchmark (I noticed 100 rows on SQL and 20 on Sphinx). Also Sphinx does the relevance sorting by default, adding relevance sorting to the MySQL queries would make them even slower, I’m sure.

Indeed, the result set of 20 rows from SphinxQL queries is merely the default. And sorting by relevance must have some effect on the speed of the query. I didn’t control for these variations, but that would be worthwhile to do.

Geoffrey L. asked:

Q: Do you have any data on memory usage for these tests?

No, I didn’t measure the memory usage for each test. I did allocate buffers in what I hope was a realistic way. For instance, when testing the InnoDB FT index, I made sure to increase innodb_buffer_pool_size to 50% of my RAM, and then when I tested MyISAM FT index, I reallocated that memory to into key_buffer_size.

The other thing that happened with regards to memory was that when I tried to create an InnoDB FT index before declaring the `FTS_DOC_ID` primary key column properly, MySQL crashed with an out of memory fatal error, and the Windows 7 server I was testing on prompted me to shut down the operating system. So be careful about declaring the correct primary key!

Mario asked:

Q: Did you compare relevancy of search results between each engine? If yes, which one seems the best?

No, I didn’t compare relevancy. I was focused solely on query response time for this test. You’re right that the specific results probably vary based on the search implementation, and that’s an important factor in choosing a solution.

David S. asked:

Q: If searching on multiple terms, can you get Sphinx to report which matched?

The SHOW META command in Sphinx shows how many rows each keyword matched, and then a total for the search expression. But I don’t know a way to report exactly which rows matched, without doing additional searches for each keyword.

Jessy B. asked:

Q: With respect to Solr and Sphinx, do indexes stay up to date with changes (inserts, updates and deletes)?

That’s a good question, because both of these external indexing technologies depend on being able to reindex as data in the source database changes. This is an important consideration for choosing a full-text solution, because updating the index can become quite complex.

Solr is easier to add documents to, either individually as you add new data to the MySQL database or else periodically batch-insert data that has changed since last time you updated the index. You can use the DataImportHandler to import the result of any SQL query, and if you can form a SELECT query that returns the “new” data (for example, WHERE updated_at > ’2012-08-22 00:00:00′, when you did the prior update), you can do this anytime.

Sphinx Search is a bit harder, because it’s quite costly to update an index incrementally — it’s basically as expensive as creating the whole index. For that reason, there are a couple of strategies used by Sphinx Search users to support changing data. One is to store two indexes, one for historical data, and the other for the “delta” of recently-changed data. Your application would have to be coded to search both indexes to find all matches among the most current data. You would merge the delta index with the main index periodically.

Sphinx Search also supports a supplementary in-memory RT index type that supports direct row-by-row updates. But you would still have to update the RT index as data changes, using application code. Since RT indexes are in volatile memory, not stored on disk, you are responsible for integrating new data with the on-disk Sphinx index periodically by reindexing the whole collection. There doesn’t currently seem to be a function to merge an RT index with an on-disk index, so integrating recent changes with the full index may require reindexing the whole dataset from time to time.

Mike W. asked:

Q: If the Sphinx index isn’t in sync, will the out-of-sync rows not be found?

No, only the documents included in the Sphinx Search indexes will be returned by Sphinx Search queries.

Mike W. also asked:

Q: What about MemSQL and indexes. Have you benchmarked it?

According to their documentation, MemSQL supports hash indexes and skip list indexes, but not full-text indexes, so comparisons would not be meaningful.

Since MemSQL is an in-memory database, you can get a lot of speed improvement because you’re searching data without touching the disk, but I assume the search would necessarily do table-scans.

Jessy B. also asked:

Q: Were these tests performed on a single machine and a common/share set of disk?

The test machine I used is a Windows 7 tower with an Intel i3 processor, 8GB of RAM, and two SSD drives: one for the Windows partition, and one for the MySQL data partition. I performed all the tests on this machine. I realize this isn’t representative of modern production systems, but hopefully by performing all the tests on the same hardware, I got results that are least comparable to each other.

Hernan S. commented on the blog post where I announced the webinar:

We evaluated MySQL vs Solr. I was able to index all the data from the database into Solr and make it queryable from a browser within four days plus some customization on the search algorithm. It would have taken me two to three weeks to do something equivalent with MySQL and it wouldn’t be as flexible and customizable as Solr. With Solr, I was able to fine tune search and I still feel there are tons of additional features that will help me address future needs.

Great points Hernan. Each application project is different, and has different requirements for the FT functionality. So one solution may include advanced features that are must-have for your application, but are not so important for another application. I tried to test only the functionality all these solutions had in common, so I tested only simple queries without customizing the indexing.

Jeffrey S. asked:

Q: Do you know if there’s a good reference that discusses what technologies might be most adequate for various application types? (I can only think of my company’s video library search). Talking about the relevance of search results, sorry if it wasn’t clear.

No, I don’t know of a reference that compares these technologies for different application types. There are books that describe how to use one technology or the other, and some may compare one technology to the other, but typically these comparisons are made with respect to individual features, not in the context of application types. You’d have to evaluate how relevant the search results are for your application needs, this isn’t something a benchmark can tell you.

Thanks for all the questions!

I’d like to see some of the folks who viewed my Full-Text Search Throwdown webinar when I present the popular Percona Training in Salt Lake City, September 24-27. See http://www.percona.com/training/ for details on our training offerings and the schedule for upcoming events.

The post Full Text Search Webinar Questions Followup appeared first on MySQL Performance Blog.

Full Text Search Webinar Questions Followup

Trending Articles

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Outlook でメールを保存または送信時に...

Nalgonda District Police Office Mobile Numbers List in Telangana State

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Arrest logs for Wednesday, March 20, 2019

Mp3 Download: Mandoza - Godoba

Practice Sheet of Right form of verbs for HSC Students

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

O'CONNELL MICHAEL F. 11/29/197...

Bureau of Internal Revenue: Regional Offices (Directory)

Felony Arrest of Joseph A. White and Heather Coomer-White

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Teen Shot In Miami Drive-By Dies From Injuries

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Arrow Flash 2 – Sinhala Dubbed – Episode 17 – 28th February 2016

Download: Bicko Bicko ft Rich Bizzy & Crew G- Wanfulanganya (Prod by: Bicko...

Moondru Mudichu 02-03-2017 – Polimer tv Serial

SEAGCD2 - Editorial

Not right!