Go to the first, previous, next, last section, table of contents.


24 MySQL internals

This chapter describes a lot of things that you need to know when working on the MySQL code.

24.1 MySQL threads

The MySQL server creates the following threads:

mysqladmin processlist only shows the connection and INSERT DELAYED threads.

24.2 MySQL full-text search

Since Version 3.23.23, MySQL has support for full-text indexing and searching. Full-text index in MySQL is an index of type FULLTEXT. FULLTEXT indexes can be created from VARCHAR and TEXT columns at CREATE TABLE time or added later with ALTER TABLE or CREATE INDEX. Full-text search is performed with the MATCH function.

mysql> CREATE TABLE t (a VARCHAR(200), b TEXT, FULLTEXT (a,b));
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t VALUES
    ->   ('MySQL has now support', 'for full-text search'),
    ->   ('Full-text indexes', 'are called collections'),
    ->   ('Only MyISAM tables','support collections'),
    ->   ('Function MATCH ... AGAINST()','is used to do a search'),
    ->   ('Full-text search in MySQL', 'implements vector space model');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM t WHERE MATCH (a,b) AGAINST ('MySQL');
+---------------------------+-------------------------------+
| a                         | b                             |
+---------------------------+-------------------------------+
| MySQL has now support     | for full-text search          |
| Full-text search in MySQL | implements vector-space-model |
+---------------------------+-------------------------------+
2 rows in set (0.00 sec)

mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t;
+------------------------------+-------------------------------+--------+
| a                            | b                             | x      |
+------------------------------+-------------------------------+--------+
| MySQL has now support        | for full-text search          | 0.3834 |
| Full-text indexes            | are called collections        | 0.3834 |
| Only MyISAM tables           | support collections           | 0.7668 |
| Function MATCH ... AGAINST() | is used to do a search        |      0 |
| Full-text search in MySQL    | implements vector space model |      0 |
+------------------------------+-------------------------------+--------+
5 rows in set (0.00 sec)

The function MATCH matches a natural language query AGAINST a text collection (which is simply the columns that are covered by a FULLTEXT index). For every row in a table it returns relevance - similarity measure between the text in that row (in the columns that are part of the collection) and the query. When it is used in a WHERE clause (see example above) the rows returned are automatically sorted with relevance decreasing. Relevance is a non- negative floating-point number. Zero relevance means no similarity. Relevance is computed based on number of words in the row and number of unique words in that row, total number of words in the collection, number of documents (rows), that contain a particular word, etc.

MySQL uses a very simple parser to split text into words. A "word" is any sequence of letters, numbers, ', and _. Any "word" that is present in the stopword list or just too short (3 characters or less) is ignored.

Every correct word in the collection and in the query is weighted, according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Otherwise, if the word is rare, it will receive a higher weight. Weights of the words are then combined to compute the relevance.

Such a technique works best with big collections (in fact, it was carefully tuned up this way). For very small tables, word distribution does not reflect adequately their semantical value, and this model may sometimes produce bizarre results.

For example search for the word "search" will produce no results in the above example. Word "search" is present in more than half of rows, and as, such, is effectively treated as stopword (i.e. with semantical value zero). It is, really, the desired behaviour - natural language query should not return every second row in 1GB table.

The word that select 50% of rows has low ability to locate relevant documents (and will find plenty of unrelevant documents also - we all know this happen too often when we are trying to find something in Internet with search engine), and, as such, has low semantical value in this particular dataset.


Go to the first, previous, next, last section, table of contents.