Geeks With Blogs

Victor Cassel A Microsoft data programmer guy in Stockholm

In this post I will discuss some search skills and tricks that I have acquired during my years of developing in .Net against Microsoft SQL Server:

  • Dynamic queries
  • Regex-searching
  • Free-text searching

At my previous position, we needed to implement free-text search for more than 600,000 bug documents, and were happy to find out that SQL Server had full-text searching capabilities in the package. The free-text search looks very similar to the SharePoint standard search engine, historically at least, it seems the SharePoint delegated its searching tasks to SQL Server. The SQL Server full-text feature has built-in support for word breaking and inflections, thesaurus and noise word-lists for most languages. I will show how to setup the full-text search for preferred tables and columns, then give a usage example.

First a technology backgrounder: How does a search engine (aka text search, free-text or full-text search) differ from the structured SQL query approach used in a relational database? And when is one better than the other? Well, a search engine is much faster for searching large data sets, is simpler to use for the end user with its natural language capabilities, but is also more complex and difficult to setup on the backside, and it normally limits your result to the most relevant few hits. Relational SQL engines on the other hand, are the best choice for programming against already high-structured data, to get exact query results, and for computing and transactions.

Let’s begin with going through some search obstacles you might have stumbled upon in SQL Server.

Dynamic SQL queries and optional search parameters.

The most dynamic solution is to build the TSQL query strings in code and sending them to the database, but this opens up for bugs, security holes and maintenance problems. So try to stay away from basic string concatenation and hard coding - maybe the following is a substitute.
Suppose we want to search a log table for rows between @dateFrom to @dateTo, where parameters are optional, meaning an open-ended interval. Then you can try this TSQL solution :
   WHERE (@dateFrom is null OR @dateFrom <= logDate) AND (@dateTo is null OR @dateTo > logDate)
The above TSQL code relies on the fact that parameters can always be null and expressions are short-circuit evaluated. So, the right-hand OR-part is skipped when @dateFrom or @dateTo are null.

Another common scenario is that you interface the database from within .Net code through an ORM-layer like LINQ2Sql or LINQ2EF. This gives you static typing for table and column names and their relations, and you formulate your queries in LINQ instead of TSQL, and get full intellisense support. Now, since the LINQ->TSQL translation and query execution is deferred until first data fetch, you can incrementally construct semi-dynamic queries, see the C#-code below.

   var query = from item in …
   if (dateFrom != null)
      query = query.Where(item => item.logDate >= dateFrom);   
   if (dateTo != null)
      query = query.Where(item => item.logDate < dateTo);
   var firstHit = query.FirstOrDefault(); // on this line the TSQL is generated and sent to the database server for querying

Here is another super handy trick for debugging and intercepting the TSQL being sent to the server:
context.Database.Log = Console.Write; // in EF 6.0
context.Log = Console.Out;
// in linq2sql

Regex-searching

The closest thing to Regex in SQL Server are the keywords LIKE and PATINDEX, because they have pattern matching, for example:
   WHERE au_lname LIKE 'de[^P]%'
This fetches all the authors whose last names start with de and the following letter is not a P. There is also a possibility to extend the TSQL query language with your own .Net CLR assemblies, see example here: .Net RegEx in TSQL.

Full-text searching

LIKE-searches are very slow compared to full-text searches.
The free-text search feature is ready to use in a default SQL server installation, but probably unknown to most users and developers. You need to complete the following steps:
  1. Create a full text catalog at the database level. This is a logical container for the index so it is automatically managed in backups/restores.
    image
  2. Next add a full-text index for any table/view you wish to index. At least one unique index is required in the table/view, so use your primary key. At the last page in the wizard you should select the string-based columns that you want to include in your full-text index, and select the language the texts are normally written in.
    image
  3. Next you can right click and force a full population of the newly created index. But no sweat, otherwise the indexation process till run silently and incrementally in the background. A full indexation of a million documents will take a couple of minutes.
  4. You are ready to query with full-text support. Use the CONTAINS()- or FREETEXT()-functions in the WHERE-clause of a query or CONTAINSTABLE() or FREETEXTTABLE() in the FROM-clause of a query (using an inner join on the primary key). The difference is that the CONTAINS and FREETEXT are boolean functions (match/no match) whereas the table functions return a rowset with ranked results. In a search solution it is better to use the table-functions since you are interested in the best matches. In the calls, be sure to set the optional TopN-parameter so the server will do the work of filtering to the best results. (The two table functions actually just return two columns, the primary key as a string and the rank as a number from 1-1000.)
 

In short, the differences between the Contains- and the Freetext-functions are that the Contains- variant is more configurable and the Freetext variant is more Google-like and easier to use.

Here is an example that uses the CONTAINSTABLE to find the best 5 matches from the product table using a NEAR clause, taken from examples online.

Example of a Contains-query

SELECT FT_TBL.ProductDescriptionID,

   FT_TBL.Description,

   KEY_TBL.RANK

FROM Production.ProductDescription AS FT_TBL INNER JOIN

   CONTAINSTABLE (Production.ProductDescription,

      Description,

      '(light NEAR aluminum) OR(lightweight NEAR aluminum)',

      5

   ) AS KEY_TBL

       ON FT_TBL.ProductDescriptionID = KEY_TBL.[KEY]

Sort order

Especially when searching, the collation and accent sensitivity in the database are important, because these settings decide how strings are compared and how rows are sorted. Run this TSQL to see current settings:

SELECT * FROM sys.databases

Recognized languages

SELECT * FROM sys.fulltext_languages

Theasaurus

There is also a thesaurus-file per language, in XML format, that can be used to look up synonyms when querying, but it is empty per default. They are found in this path:

C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTData

Noise words

List the noise words (stop words) with:

SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 1033

Some English noise words are: about, after, all, also, an, and, another, any

To customize the noise word list go to Storage>Full Text Stoplists and right click New Full-Text Stoplist… and choose Create from the system stoplist, call it for example MyStopList. Then add a new noise word:

ALTER FULLTEXT STOPLIST MyStopList ADD 'microsoft' LANGUAGE ‘English'

If you want to see how words are classified as noise words or not, test this:

SELECT special_term, display_term

FROM sys.dm_fts_parser

(N' "Which of these are noise words?" ', 1033, 0, 1)

The raw index

Next, if you want to take a look at the raw inverted search index, it is not possible, it seems hidden away in some internal data row set on the Sql Server. However, the below TSQL gives a simple overview of occurences of words together with corresponding table primary key (called document_id) and the column name where they occur.

SELECT * FROM sys.dm_fts_index_keywords_by_document

(DB_ID('TESTDB'), OBJECT_ID('Article'))

Text searching of files stored in tables

And finally, an impressive full-text feature - you can index most files formats. Run below TSQL to see recognized document formats. The documents should be stored in a varbinary(max) column in a table. You also must identify the file contents by its file extension in a dedicated content description column.

Recognized document formats

This command lists the document formats understood by SQL Server, (besides shows word breaker DLL.s available).

EXEC sp_help_fulltext_system_components 'all';

Strangely, the MS Office Xml-based formats (docx, xmlx and pptx from version 2007) are not included. You need to install an additional Filter pack, like this:

Posted on Wednesday, April 30, 2014 5:24 AM | Back to top


Comments on this post: Three Easy Search Skills for SQL Server

No comments posted yet.
Your comment:
 (will show your gravatar)


Copyright © Victor Cassel | Powered by: GeeksWithBlogs.net