Chinese segmentation technology and Google talk about love in Shanghai

11 "long today every day, love Shanghai to identify the" long today / every day "," dachangjin "is a drama," every day "for a variety show, love the sea are able to correctly identify and show that the lexicon of words in a wide range, a lot of words. In 7 cases, 8, Shanghai to love in Chinese will not ignore "the" and "and" these meaningful words (i.e., disable word) can do more to try, to "love Shanghai" and "you" are not ignored, still as a search word segmentation.

Chinese segmentation is one of the core technologies of love Shanghai and Google, it will not be in the public details. The paper alliance only is derived using black box method, namely, input query, view the results, and are available through two search engine web snapshot view of segmentation retrieval queries.

Table 1

first examples, if accurate retrieval of query with quotes, love Shanghai word segmentation operation of its. In 2 cases, if the input of a number of sub string segmentation in space, love Shanghai for the automatic word segmentation do (if segmentation, punctuation will do automatic word processing). 3, 6, 7, 8, 10 are of key words will love Shanghai for a variety of segmentation methods, and the first is not the segmentation, direct matching, if the result is returned, the front side by side in search results; and then according to different forms of segmentation to match.

of love Shanghai is positive maximal matching or reverse maximum matching. In Example 7, to "walk and temperament" can "walk out / / and temperament", this result is like a reverse maximum matching method to 12 cases, the "standard of living", if the reverse maximum matching method should be divided into words "/ / residents living standard", but the truth love Shanghai on the search query segmentation results is not the case, in this case the seemingly positive word. Therefore, the word love Shanghai is not simply the maximum matching or reverse maximum matching, use should be bidirectional maximum matching method.

of 4 cases of "Bill Gates" search showed that love Shanghai have special famous thesaurus, 5 cases of "search xuriyanggang" shows that Shanghai love of the words included faster. 9 "the word Yang just" results "xuriyanggang / it also shows that the love of Shanghai is able to identify new words.


search query in Shanghai and Google in the word *

love Shanghai

first, Chinese word derivation of

there are many examples through classified methods (such as 8), we can see that Shanghai is not the first love word, "Zhu De’s mother", and then identify the proper names or words, the remaining part and not in accordance with the segmentation method, "Zhu De’s mother", and then use the principle of least segmentation method, and use the 3 element cross segmentation method >

