PostgreSQL: 文档：15：8.11. 文本搜索类型

文档 → PostgreSQL 15

受支持版本：当前版本 (18) / 17 / 16 / 15 / 14

开发版本： 19 / devel

8.11. 文本搜索类型
Prev	Up	Chapter 8. 数据类型	Home	Next

8.11. 文本搜索类型 #

8.11.1. tsvector
8.11.2. tsquery

PostgreSQL 提供了两种专为支持全文搜索而设计的数据类型。所谓全文搜索，是指在一组自然语言文档中查找最匹配某个查询的文档。tsvector 类型以适合文本搜索的优化形式表示文档，tsquery 类型则表示文本查询。关于这一功能的详细解释见 Chapter 12；相关函数和操作符的概览见 Section 9.13。

8.11.1. `tsvector` #

tsvector 值是一个排好序且互不重复的词位（lexeme）列表，这些词已经过 规范化，以便把同一单词的不同变体合并起来（详见 Chapter 12）。排序和去重会在输入时自动完成，如下例所示：

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

若要表示包含空白或标点的词位，请用引号将它们括起来：

SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'

（本例及下例使用美元引用的字符串常量，以避免在字符串内部必须双写引号所造成的混淆。）嵌入的引号和反斜线必须双写：

SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector
------------------------------------------------
 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'

还可以为词位附加整数形式的位置：

SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
-------------------------------------------------------------------------------
 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4

一个位置通常表示源词在文档中的位置。位置信息可用于 邻近度排序。位置值可以位于 1 到 16383 之间；更大的数字会被静默设为 16383。同一词位的重复位置会被丢弃。

带有位置的词位还可以进一步附加一个权重标签，其值可以是 A、B、C 或 D。D 是默认值，因此在输出中不会显示：

SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector
----------------------------
 'a':1A 'cat':5 'fat':2B,4C

权重通常用于反映文档结构，例如把标题中的词和正文中的词区分开来。文本搜索排序函数可以为不同的权重标记分配不同优先级。

必须认识到，tsvector 类型本身并不会执行任何词语规范化；它假定输入的词已经按照应用需求完成规范化。例如：

SELECT 'The Fat Rats'::tsvector;
      tsvector
--------------------
 'Fat' 'Rats' 'The'

对于大多数英文全文搜索应用来说，上述词会被视为尚未规范化，但 tsvector 并不在意。原始文档文本通常应先经过 to_tsvector，以按搜索需要对词语进行规范化：

SELECT to_tsvector('english', 'The Fat Rats');
   to_tsvector
-----------------
 'fat':2 'rat':3

更多细节仍请参见 Chapter 12。

8.11.2. `tsquery` #

tsquery 值存储要搜索的词位，并可用布尔操作符 &（AND）、|（OR）和 !（NOT）将它们组合起来，也可使用短语搜索操作符 <->（FOLLOWED BY）。此外， FOLLOWED BY 还有一种变体 <N>，其中 N 是整数常量，用于指定被搜索的两个词位之间的距离。<-> 等效于 <1>。

可以使用圆括号强制指定这些操作符的分组方式。若没有圆括号， !（NOT）的绑定最紧，其次是 <->（FOLLOWED BY），再其次是 &（AND），最后是 |（OR）。

以下是一些示例：

SELECT 'fat & rat'::tsquery;
    tsquery
---------------
 'fat' & 'rat'

SELECT 'fat & (rat | cat)'::tsquery;
          tsquery
---------------------------
 'fat' & ( 'rat' | 'cat' )

SELECT 'fat & rat & ! cat'::tsquery;
        tsquery
------------------------
 'fat' & 'rat' & !'cat'

可选地，tsquery 中的词位可以用一个或多个权重字母标注，这会限制它们只匹配在 tsvector 中带有这些权重之一的词位：

SELECT 'fat:ab & cat'::tsquery;
    tsquery
------------------
 'fat':AB & 'cat'

此外，tsquery 中的词位还可以带上 * 标签来指定前缀匹配：

SELECT 'super:*'::tsquery;
  tsquery
-----------
 'super':*

这个查询将匹配 tsvector 中任何以 “super” 开头的词位。

引号的使用规则与前面介绍 tsvector 时相同；同样，与 tsvector 一样，任何需要的词语规范化都必须在转换为 tsquery 类型之前完成。to_tsquery 函数很适合用来实现这种规范化：

SELECT to_tsquery('Fat:ab & Cats');
    to_tsquery
------------------
 'fat':AB & 'cat'

请注意，to_tsquery 会像处理其他词一样处理前缀，这意味着下面的比较会返回真：

SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' );
 ?column?
----------
 t

因为 postgres 会被词干化为 postgr：

SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' );
  to_tsvector  | to_tsquery
---------------+------------
 'postgradu':1 | 'postgr':*

因而它能够匹配其带前缀的后继形式 postgraduate。

Prev	Up	Next
8.10. 位串类型	Home	8.12. UUID类型

提交更正

如果您发现文档中有不正确的内容、与您使用特定功能的经验不符或需要进一步说明，请使用此表单来报告文档问题。

8.11. 文本搜索类型 #

8.11.1. tsvector #

8.11.2. tsquery #

提交更正

8.11.1. `tsvector` #

8.11.2. `tsquery` #