tsearch2 for Rails applications
Having finally managed to squirrel out some free time to work on Cookmarking. I really wanted to add a bit of functionality that we were lacking in the initial implementation phase, one of the things we reckoned was really important was the ability to search bookmarks.
The way you implement a search in a Rails application can vary, from really simple ilike / like implementations such as those given by Searchlogic to something a little more sophisticated such as Thinking Sphinx or Acts As Ferret.
There are other contenders to search which are largely ignored by the Rails community – of which I reckon one of the more stronger ones is tsearch2. The advantage that tsearch2 has over Searchlogic is that much like Thinking Sphinx and Acts As Ferret is that it is a full text search, the main advantage being performance. There are other points to consider such as the inclusion of stop words (e.g. to ignore the word “the” in searches) as well as being able to rank results and only having to maintain a single index (the amalgamation of all your fields for search).
So, why tsearch2 over Thinking Sphinx and Acts As Ferret? the main thing being that it is built into Postgres and does not require an external daemon or any other gems to make it work. It will work right out of the box with Postgres 8.3 and without having to install pl/pgsql in Postgres 9.0 and is blazingly fast when returning a small-medium resultset (150k records).
Initial preparation
Preparation is simple – if you are working on a new application, its as simple as having a column of type tsvector in the table for which you want to search as well as a gin index, however, if you are adding to an existing application – you will need to do the following:
Create tsvectors for existing records
The best way to create tsvectors for existing content is to write a function:
-- needed for existing and new content
alter table bookmarks add column tsv tsvector;
create index bookmarks_tsv_idx on bookmarks using gin(tsv);
-- needed only for existing content
create or replace function refresh_bookmarks_tsv()
returns void as $$
begin
update bookmarks
set tsv = q.weighted_tsv
from
(select
bookmarks.id,
setweight(to_tsvector(coalesce(bookmarks.title, '')), 'A')
|| setweight(to_tsvector(coalesce(bookmarks.description, '')), 'C')
|| setweight(to_tsvector(coalesce(string_agg(tags.name, ' '), '')), 'B')
as weighted_tsv
from bookmarks
left join user_bookmarks on user_bookmarks.bookmark_id = bookmarks.id
left join taggings on (taggings.taggable_type = 'UserBookmark'
and taggings.taggable_id = user_bookmarks.bookmark_id)
left join tags on taggings.tag_id = tags.id
group by bookmarks.id, bookmarks.title, bookmarks.description) q
where q.id = bookmarks.id;
end
$$ language plpgsql;
select refresh_bookmarks_tsv();
Note here, I have used acts_as_taggable_on for my tags. Lets have a rundown of what exactly this does
string_agg
string_agg is used to turn an aggregated set of results into a string, so for example if the tags for a bookmark were chicken, garlic, bbq – you would get a result from string_agg of “chicken garlic bbq”, which is perfect for the vector conversion
setweight
Setweight is used here for ranking later on, so that in this the title is the most important thing, followed by, the tags then followed by the description
to_tsvector
Very simple, turns a string into a tsvector
Write a function for new records to add a tsv
create or replace function bookmarks_update_tsv(integer) returns void as $$
begin
update bookmarks
set tsv = q.weighted_tsv
from
(select
bookmarks.id,
setweight(to_tsvector(coalesce(bookmarks.title, '')), 'A')
|| setweight(to_tsvector(coalesce(bookmarks.description, '')), 'C')
|| setweight(to_tsvector(coalesce(string_agg(tags.name, ' '), '')), 'B')
as weighted_tsv
from bookmarks
left join user_bookmarks on user_bookmarks.bookmark_id = bookmarks.id
left join taggings on (taggings.taggable_type = 'UserBookmark'
and taggings.taggable_id = user_bookmarks.bookmark_id)
left join tags on taggings.tag_id = tags.id
group by bookmarks.id, bookmarks.title, bookmarks.description) q
where q.id = bookmarks.id and bookmarks.id = $1;
end
$$ language plpgsql;
This is very similar to the function above, with the difference that it takes a parameter – that being an id and then has a where clause of that id. This function will be useful for triggers
Triggers on tables where insertion / updates / deletion matters
Firstly, we need to look at the shape of the application function and decide when we want the function applied, so most likely it will need to run on insertion of a bookmark and also on the update of a bookmark
create or replace function bookmarks_tsv_trigger() returns trigger as $$
begin
if (tg_op = 'INSERT') then
perform bookmarks_update_tsv(new.id);
elsif (tg_op = 'UPDATE') then
if row(new) is distinct from row(old) then
perform bookmarks_update_tsv(new.id);
end if;
end if;
return null;
end;
$$ language plpgsql;
create trigger bookmarks_tsv_tr after insert or update on bookmarks
for each row execute procedure bookmarks_tsv_trigger();
Finally, we want the function also to run on either the insertion / update or deletion of a tagging
create or replace function taggings_tsv_trigger() returns trigger as $$
declare
rec record;
begin
if (tg_op = 'INSERT') then
if (new.taggable_type = 'UserBookmark') then
select * into rec from user_bookmarks
where user_bookmarks.id = new.taggable_id;
perform bookmarks_update_tsv(rec.bookmark_id);
end if;
elsif (tg_op = 'UPDATE') then
if (new.taggable_type = 'UserBookmark') then
select * into rec from user_bookmarks
where user_bookmarks.id = new.taggable_id;
perform bookmarks_update_tsv(rec.bookmark_id);
end if;
elsif (tg_op = 'DELETE') then
if (old.taggable_type = 'UserBookmark') then
select * into rec from user_bookmarks
where user_bookmarks.id = old.taggable_id;
perform bookmarks_update_tsv(rec.bookmark_id);
end if;
end if;
return null;
end;
$$ language plpgsql;
create trigger taggings_tsv_tr after insert or update or delete on taggings
for each row execute procedure taggings_tsv_trigger();
Migration
All of this needs to go inside a migration – for all migrations I use an execute – since I am more comfortable writing SQL rather than the ActiveRecord migration DSL. The up looks like:
def self.up
execute <<EOF
SQL HERE
EOF
end
I also tend not to write downs, since I normally use a production dump for development, besides that – migrations are run inside a transaction – so, if anything fails, it just needs to be re-run.
Usage
In order to use tsearch2, its as simple as any other query:
scope :search, lambda { |search|
if search.blank?
select("bookmarks.*,
coalesce(avg(user_bookmarks.rating)::int, 0) as average_rating,
coalesce(count(user_bookmarks.review), 0) as number_of_reviews")
else
query = sanitize_sql_array ["plainto_tsquery(?)", search]
select("bookmarks.*,
coalesce(avg(user_bookmarks.rating)::int, 0) as average_rating,
coalesce(count(user_bookmarks.review), 0) as number_of_reviews,
ts_rank_cd(tsv, #{query}, 32) as rank").where("tsv @@ #{query}")
end.joins("left join user_bookmarks on user_bookmarks.bookmark_id = bookmarks.id").
group(columns.collect {|s| "bookmarks.#{s.name}"}.join(", "))
}
And there you have it – simple, no dependancy search powered by SQL.
“The way you implement a search in a Rails application can vary, from really simple ilike / like implementations such as those given by Searchlogic to something a little more sophisticated such as Thinking Sphinx or Acts As Ferret”