AND Search using SQL Server Full Text Search

DISCLAIMER: I’m new to SQL Server Full Text Search, so if what I’m saying here is wrong then let me know.

Sticking with the Knowledge Base project, we had to implement search. We decided to use SQL Server Full Text Search for this. Once the fulltext catalogs were setup, the query my colleague wrote looked like this:

SELECT [Key], [Rank]
FROM FREETEXTTABLE(dbo.Question, QuestionText, 'building consent form', 100)

The problem with this search is that it will return results which contain any one of the keywords. So if a user enters “building consent form”, a page only needs to contain building OR consent OR form and it will be included in the results. I call this an OR search. This reminds me of the early days of search engines, (AltaVista.com et al), which used to work the same way.

AltaVista c.1999. Remember?

Back in the AltaVista days, if you wanted only pages with every search term you had to use AND between each search term e.g. [building AND consent AND form]. Nowadays you don’t need to type the ANDs; if you enter [building consent form] into Google, it will only return results which contain building, consent, and form. This is what I call an AND search.

SQL Server Full Text Search seems to be stuck in the late 1990s days of OR searches. As demonstrated above it will return a hit for any one of the search words. And there doesn’t seem to be an easy way to tell it to do an AND search.

AND Search

To do an AND search with SQL Server Full Text Search, you need to query the CONTAINSTABLE.

SELECT	[Key], [Rank], 1
FROM	CONTAINSTABLE(dbo.Question, QuestionText, '"building" AND "consent" AND "form"', 100)

Because we’re querying the CONTAINSTABLE and not the FREETEXTTABLE we lose all Full Text Search’s nice fuzziness around the search terms, i.e. in the above example, Questions which contain “buildings”, “consent”, and “form” won’t be returned. To help we can do the following:

SELECT	[Key], [Rank], 1
FROM	CONTAINSTABLE(dbo.Question, QuestionText, '"building*" AND "consent*" AND "form*"', 100)

Stop right there

Another problem with this is stopwords, aka noise words (SQL Server 2005 calls them noise words, 2008 stopwords). SQL Server has a list of around 150 common words, which are ignored by the Full Text Search engine. e.g. about, after, all, also, an, and, another etc. You can view the English stopwords in SQL Server 2008 like so:

select * from sys.fulltext_system_stopwords where language_id = 1033

In my table there is a Question titled “What are building consents?”. If we search for “what are building consents” like so:

SELECT	[Key], [Rank], 1
FROM	CONTAINSTABLE(dbo.Question, QuestionText, '"what*" AND "are*" AND "building*" AND "consents*"', 100)

because “are” is a stopword (and so is “what”) we won’t get any results!
There is a workaround – to wipe the list of stop words! The problem with doing this is that the list of stop words is setup server-wide on the SQL Server itself. In our production environment other applications are using the same SQL Server instance (but of course we have our own KnowledgeBase database), so this wasn’t an option.

Then the customer added a new requirement – if a Question contains the exact search phrase then that should be returned first. Adding the code for that use case fixes my search.

CREATE PROCEDURE dbo.Search
(
@keywords NVARCHAR(4000), — e.g. ‘”building*” AND “consent*” AND “form*”‘
@searchPhrase NVARCHAR(4000) — e.g. ‘building consent form’
)
AS
BEGIN
— Tmp table —
DECLARE @searchResults TABLE
(
QuestionID int,
Rank int,
ExactMatchToQuestionText bit default 0
)

— Question contains keywords in order —
INSERT @searchResults (QuestionID, ExactMatchToQuestionText)
SELECT QuestionID, 1
FROM Question
WHERE QuestionText LIKE ‘%’ + @searchPhrase + ‘%’

— Free Text Search matches —
INSERT @searchResults (QuestionID, Rank, ExactMatchToQuestionText)
SELECT [Key], [Rank], 0
FROM CONTAINSTABLE(dbo.Question, QuestionText, @keywords, 100)
AND [Key] not in (SELECT QuestionID from @searchResults)

— Return —
SELECT Question.*
FROM @searchResults [Results]
JOIN Question ON Results.QuestionID = Question.QuestionID
ORDER BY
Results.ExactMatchToQuestionText DESC,
Results.Rank DESC
END

Still wrong

This code still has a bug – a search for [what are building consents] returns 1 hit, but a search for [what building consents] returns 0, because “what” is a SQL Server stopword. D’oh!
At this point I am better off throwing away SQL Full Text Search and completely rolling my own query.
So why doesn’t SQL Server Full Text Search include a simple option for doing AND searches against the FREETEXTSEARCH table?
And, does anyone know a better way to do AND searches using SQL Server Full Text Search?

Advertisements

3 thoughts on “AND Search using SQL Server Full Text Search

  1. CONTAINSTABLE also seems to be a lot stricter with regard to phrase searches.
    For instance let’s say we have a question: “Does Microsoft produce an annual report?”.

    Using CONTAINSTEXT and the following search phrase: “Does AND Microsoft AND produce AND an AND annual AND report” gives us a match but excluding the “an” (“Does AND Microsoft AND produce AND annual AND report”) does not. It seems like CONTAINSTEXT is quite strict about the terms being ANDed and does not seem to see the “an” as a stopword. It seems to be seatching for questions with only the terms provided.

    FREETEXTTABLE seems to be slightly “fuzzier” and using “Does Microsoft produce an annual report” and “Does Microsoft produce annual report” as search phrases provides a match in each instance.

    Perhaps using a combination of CONTAINSTEXT and FREETEXTTABLE is the best option.

  2. Well, I also came across the same issue with fulltext search. In my project I have implemented a Phrase search and an AND search. In order to include the stop words in my search field, I have made the fulltext index with a {Full-Text Index Stoplist = } option, and that helped to inclide the stop words as well.

    Also I had another issue with the AND search, I have to search all full text columns ( In my case, Title, ShortDescription, Synopsis, Author from the Product table) to get the result and the containstable with a * option doesn’t helped me. Finally I came up with a solution to create a view with a single column ( Title + ShortDescription + Synopsis + Author), then do the containstable with this field.

    Still I am struggling with an issue, as my search returns duplicate rankings and the orderby is not as I expected. Any help?

  3. I got the same issue. I solved it by using a UDF where I dropped stop words from the string searched:

    CREATE FUNCTION [dbo].[fnDropStopWords](
    @sInputList NVARCHAR(4000)
    ) RETURNS NVARCHAR(4000)
    BEGIN
    DECLARE @sOutputList NVARCHAR(4000)
    DECLARE @List TABLE (item NVARCHAR(4000) COLLATE Modern_Spanish_CS_AS)

    SET @sOutputList = ”
    INSERT INTO @List SELECT * FROM dbo.SplitString(@sInputList,’ ‘)

    SELECT @sOutputList = @sOutputList + item + ‘ ‘ FROM @List
    WHERE item NOT IN ( select stopword COLLATE Modern_Spanish_CS_AS
    from sys.fulltext_stopwords
    where language_id=3082)

    RETURN @sOutputList
    END

    CREATE FUNCTION [dbo].[SplitString]
    (
    @List NVARCHAR(MAX),
    @Delim VARCHAR(255)
    )
    RETURNS TABLE
    AS
    RETURN ( SELECT [Value] FROM
    (
    SELECT
    [Value] = LTRIM(RTRIM(SUBSTRING(@List, [Number],
    CHARINDEX(@Delim, @List + @Delim, [Number]) – [Number])))
    FROM (SELECT Number = ROW_NUMBER() OVER (ORDER BY name)
    FROM sys.all_objects) AS x
    WHERE Number <= LEN(@List)
    AND SUBSTRING(@Delim + @List, [Number], LEN(@Delim)) = @Delim
    ) AS y
    );

    Regards.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s