public abstract class ASegment extends Object implements ISegment
| Modifier and Type | Field and Description |
|---|---|
protected JcsegTaskConfig |
config |
protected int |
ctrlMask
segmentation runtime function control mask
|
protected ADictionary |
dic
the dictionary and task configuration instance
|
protected IntArrayList |
ialist |
protected int |
idx
the index value of the current input stream
mainly for track the start position of the token
|
protected IStringBuffer |
isb |
protected IPushbackReader |
reader |
protected LinkedList<IWord> |
wordPool
CJK word cache pool, Reusable string buffer
and the array list for basic integer
|
CHECK_CE_MASk, CHECK_CF_MASK, START_SS_MASK| Constructor and Description |
|---|
ASegment(JcsegTaskConfig config,
ADictionary dic) |
ASegment(Reader input,
JcsegTaskConfig config,
ADictionary dic)
initialize the segment
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
appendLatinSyn(IWord w)
Check and append the synonyms words of specified word included the CJK and basic Latin words
All the synonyms words share the same position part of speech, word type with the primitive word
|
protected void |
appendWordFeatures(IWord word)
check and append the pinyin and the synonyms words of the specified word
|
protected IWord |
enSecondSeg(IWord w,
boolean retfw)
Do the secondary split for the specified complex Latin word
This will split a complex English, Arabic, punctuation compose word to multiple simple parts
Like 'qq2013' will split to 'qq' and '2013'
|
protected String |
findCHName(char[] chars,
int index,
IChunk chunk)
find an Chinese name from the current position of the input chars
|
boolean |
findCHName(IWord w,
IChunk chunk)
Deprecated.
|
protected abstract IChunk |
getBestCJKChunk(char[] chars,
int index)
an abstract method to gain a CJK word from the
current position.
|
JcsegTaskConfig |
getConfig()
get the current task configuration instance.
|
ADictionary |
getDict()
get the current dictionary instance.
|
protected IWord |
getNextCJKWord(int c,
int pos)
get the next CJK word from the current position of the input stream
|
protected IWord |
getNextLatinWord(int c,
int pos)
get the next Latin word from the current position of the input stream
|
protected IWord[] |
getNextMatch(char[] chars,
int index)
match the next CJK word in the dictionary
|
protected IWord |
getNextPunctuationPairWord(int c,
int pos)
get the next punctuation pair word from the current position
of the input stream.
|
protected String |
getPairPunctuationText(int c)
find pair punctuation of the given punctuation char
the purpose is to get the text bettween them
|
int |
getStreamPosition()
get the current length of the stream
|
IWord |
next()
segment a word from a char array
from a specified position.
|
protected IWord |
nextBasicLatin(int c)
find the letter or digit word from the current position
count until the char is whitespace or not letter_digit
|
protected char[] |
nextCJKSentence(int c)
load a CJK char list from the stream start from the
current position till the char is not a CJK char
|
protected String |
nextCNNumeric(char[] chars,
int index)
find the chinese number from the current position
count until the char in the specified position is not a orther number or whitespace
|
protected String |
nextLetterNumber(int c)
find the next other letter from the current position
find the letter number from the current position
count until the char in the specified position is not a letter number or whitespace
|
protected String |
nextOtherNumber(int c)
find the other number from the current position
count until the char in the specified position is not a orther number or whitespace
|
protected void |
pushBack(int data)
push back the data to the stream.
|
protected int |
readNext()
read the next char from the current position
|
void |
reset(Reader input)
input stream and reader reset.
|
void |
setConfig(JcsegTaskConfig config)
set the current task configuration instance.
|
void |
setDict(ADictionary dic)
set the dictionary of the current tokenizer.
|
protected int idx
protected IPushbackReader reader
protected LinkedList<IWord> wordPool
protected IStringBuffer isb
protected IntArrayList ialist
protected int ctrlMask
protected ADictionary dic
protected JcsegTaskConfig config
public ASegment(Reader input, JcsegTaskConfig config, ADictionary dic) throws IOException
input - config - Jcseg task configuration instancedic - Jcseg dictionary instanceIOExceptionpublic ASegment(JcsegTaskConfig config, ADictionary dic) throws IOException
IOExceptionASegment(Reader, JcsegTaskConfig, ADictionary)public void reset(Reader input) throws IOException
reset in interface ISegmentinput - IOExceptionprotected int readNext()
throws IOException
IOExceptionprotected void pushBack(int data)
throws IOException
data - IOExceptionpublic int getStreamPosition()
ISegmentgetStreamPosition in interface ISegmentpublic void setDict(ADictionary dic)
dic - public ADictionary getDict()
public void setConfig(JcsegTaskConfig config)
config - public JcsegTaskConfig getConfig()
public IWord next() throws IOException
ISegmentnext in interface ISegmentIOExceptionISegment.next()protected IWord getNextCJKWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected IWord getNextLatinWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected IWord getNextPunctuationPairWord(int c, int pos) throws IOException
c - pos - IOExceptionprotected void appendWordFeatures(IWord word)
word - protected void appendLatinSyn(IWord w)
w - protected IWord enSecondSeg(IWord w, boolean retfw)
Do the secondary split for the specified complex Latin word This will split a complex English, Arabic, punctuation compose word to multiple simple parts Like 'qq2013' will split to 'qq' and '2013'
And all the sub words share the same type and part of speech with the primitive word You should check the config.EN_SECOND_SEG before invoke this method
w - retfw - whether to return the fword.protected IWord[] getNextMatch(char[] chars, int index)
chars - index - protected String findCHName(char[] chars, int index, IChunk chunk)
chars - index - chunk - @Deprecated public boolean findCHName(IWord w, IChunk chunk)
chunk - the best chunk.protected char[] nextCJKSentence(int c)
throws IOException
c - IOExceptionprotected IWord nextBasicLatin(int c) throws IOException
c - IOExceptionprotected String nextLetterNumber(int c) throws IOException
c - IOExceptionprotected String nextOtherNumber(int c) throws IOException
c - IOExceptionprotected String nextCNNumeric(char[] chars, int index) throws IOException
chars - char array of CJK itemsindex - IOExceptionprotected String getPairPunctuationText(int c) throws IOException
c - IOExceptionprotected abstract IChunk getBestCJKChunk(char[] chars, int index) throws IOException
chars - index - IOExceptionCopyright © 2016. All Rights Reserved.