INTRO

MySQL์˜ Full-Text Search์—์„œ ๋‹จ์–ด๋ฅผ ํŒŒ์‹ฑํ•˜์—ฌ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ณผ์ •์—์„œ ํ•„์š”ํ•œ paser ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

์ด์ „์— ํ”„๋กœ์ ํŠธ์—์„œ ์‚ฌ์šฉํ•ด๋ณธ ์ ์ด ์žˆ๋Š”๋ฐ, ๋‹น์‹œ ์™„์„ฑ์—๋งŒ ๋„ˆ๋ฌด ๊ธ‰๊ธ‰ํ•ด์„œ ์ง„๋“ํ•˜๊ฒŒ ๋ฌธ์„œ๋ฅผ ์ฝ๊ณ  ์ •๋ฆฌํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ์ •๋ฆฌํ•ด๋ณด์ž

Ngram

Tokenize

ngram parser ๋Š” ๋ฌธ์ž์—ด์„ n๊ฐœ์˜ ๋ฌธ์ž๋กœ ๊ตฌ์„ฑ๋œ ์—ฐ์†๋œ ์‹œํ€€์Šค๋กœ tokenize ํ•œ๋‹ค.

string = "abcd"

n=1: 'a','b','c','d'
n=2: 'ab','bc','cd'
n=3: 'abc','bcd'
n=4: 'abcd'

ํ•œ๊ธ€์€ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ?

# if token size = 2 
string = "๋นต์€ ์ปคํ”ผ๋ž‘ ๋จน์œผ๋ฉด ๋ง›์žˆ๋‹ค"

["๋นต์€","์ปคํ”ผ","ํ”ผ๋ž‘","๋จน์œผ","์œผ๋ฉด","๋ง›์žˆ","์žˆ๋‹ค"]

ํ† ํฐํ™” ํ•  ๋•Œ ๋„์–ด์“ฐ๊ธฐ(๊ณต๋ฐฑ)์€ ๋ฌด์‹œ๋œ๋‹ค.

Token Size

ngram parser์˜ ๊ธฐ๋ณธ ํ† ํฐ ์‚ฌ์ด์ฆˆ๋Š” 2(bigram)์ด๋‹ค.