| wedstrijden

10-05-07

 Belgacom Skynet launches Skynet Exist

Skynet Exist ( french part | dutch part ) is a social networking site primarily addressed to Belgians (more interactions are with people who live and work nearby) who want to aggregate not only all the digital content they create and/or are interested in, but also all the digital information they may have compulsed over the time, into one place.

 

Its “raison d’être” is of two folds.

 

The first one was to build a Web 2.0-oriented application supporting the latest technologies.

Therefore, all the effort has been brought to implement the most commonly used techniques:

  • Aggregation of data in RSS/Atom
  • Extensive use of folksomonies in the form of tags or tagclouds
  • Clean and meaningful URL
  • Rich Internet application (Ajax)

The second one was to join the social networking trend as it is now seen as a strategic component.

 

The main features of the site are to search by tags in order to find interesting profiles and to aggregate content in order to share your centres of interest.

 

by Wang Olivier

16:01 Gepost door Barbieri Sébastien in General | Permalink | Commentaren (0) | Email dit |  Facebook |

24-04-07

Thesaurus multilingual: third part

And now, how to use it in a search?

The context (term inside []) cannot be used to search, example:

=>  fuels                         <-> fuels
          NT fuels [motors] <->    NT carburant
          NT fuels [heating] <->    NT combustible

 

Suppose a document contains: …. « fuels » …’
When a French user searches “carburant”, it will be translated to:

“Carburant” <or> 0.70 “fuels [motors]”.

This document will not be retrieved.

We have to search:

“Carburant” <or> 0.70 “fuels”.

So, “[motors]” cannot be simply added to a keyword and becomes a part of it, but must be added as a ‘context’. The context is displayed in the thesaurus administration interface, and in a drop down list if you implement one in your search form, but not used in a search.

Search

In the monolingual thesaurus, the relationship gets the following default weights:

Synonym (USE) = 80%
Narrower term (NT) = 50%
Related term (RT) = 10%

For a language relationship, we will use:

Language = 70%

A language relationship will be represented by the language ISO code, fi FR, NL, DE, EN, …

An expanded search will find documents with translated keywords, fi:

Exact equivalence:

‘Bike’ –> {NL}’fiets’

 

A document contains: ‘Een rode fiets’.
The user search ‘Bike’, the search will be on:  ‘bike’ <or> 70% ‘fiets’.

Inexact equivalence

See exact equivalence.

Partial equivalence

See exact equivalence.

Single to multiple equivalence

Fuels <-> carburant + combustible

=>  ‘fuels’                         <-> ‘fuels’
           NT ‘fuels’ [motors] <->     NT ‘carburant’
           NT ‘fuels’ [heating] <->     NT ‘combustible’

 

A document contains: ‘le charbon est un combustible fossile’.
The user search ‘fuels’ (in English), the search will be:

 

‘Fuels’ <or> 70 % ‘fuels’
            <or> 50 % (‘fuels’ <or> 70 % ‘carburant’)
            <or> 50 % (‘fuels’ <or> 70 % ‘combustible’)

 

The document will be found with a weight of 50% x 70% = 35%

Different meaning

Sometimes one term has different meaning in another language, for instance:
The French word ‘TOUR’ means towers, lathes, tours and rotation. This is solved by duplicating TOUR like:

TOUR [bâtiment] <-> towers
TOUR [outil]  <-> lathes
TOUR [voyage]  <-> tours
TOUR [rotation] <-> rotation

Conclusion

The implementation of multilingual thesaurus is a relatively light process.

15:53 Gepost door Melkebeke Thierry in Technical | Permalink | Commentaren (0) | Email dit | Tags: thesaurus, search |  Facebook |

Thesaurus multilingual: second part

The issue

In a country as Belgium with 2 or 3 languages, there is a new issue.
When someone looks in French for “Belgique”, he is maybe interested by documents containing “Belgïe” in Dutch or “Belgium” in English.
As explained in the previous post, a thesaurus is a tree of related words. We will suppose there are 2 languages and both trees are identicals. This gives the possibility to create a new kind of relationship.
The difficulty is then to make the trees identical. I give here a few tips to reach that goal.

Some definitions

  • Dominant language: language for which all term are existing
  • Secondary language: language for which all terms are not present
  • Loan term: term of the dominant language used within a secondary language to avoid gap

Introduction

This study is based on the ‘equal status’ approach. When 2 languages have the same status, they have exactly the same tree structure. Only the synonyms (also called non-preferred terms) can be missing.

In practice, there is only one tree, each keyword has a label for each language, and only synonyms can be not present in all languages.

To be able to generate similar tree structure in different languages, the ISO 5964 norm gives some tips.

Exact equivalence

This is an ideal situation.

exactequivalence

For instance :

VTT <-> mountain bike

Inexact equivalence

The 2 terms are not exactly equivalent. ISO recommend treating them as equivalent.

inexactequivalence

Partial equivalence

One term is higher than its translation (Broader), ISO recommend to treat them as equivalent.

partialequivalence

Single to multiple equivalence

One term is equivalent to the sum of 2 or more terms in the other language.

singel2multipleequivalence

ISO recommend duplicating the first term, for instance:

Fuels <-> carburant + combustible

=> fuels <-> fuels
NT fuels [motors] <-> NT carburant
NT fuels [heating] <-> NT combustible

In the example, 'fuels' is loaned to French and splitted for carburant and combustible.

Different meaning

Sometimes one term has different meaning in another language, for instance:
The French word ‘TOUR’ means towers, lathes, tours and rotation. This is solved by duplicating TOUR like:

TOUR [bâtiment] <-> towers
TOUR [outil] <-> lathes
TOUR [voyage] <-> tours
TOUR [rotation] <-> rotation

References

ISO 5964 – Guidelines for the establishment and development of multilingual thesauri

Next post, we will see how to use it in a web search

15:21 Gepost door Barbieri S&eacute;bastien in Technical | Permalink | Commentaren (0) | Email dit | Tags: search, thesaurus |  Facebook |

23-04-07

Full text search features

There are different kinds of full text search features.
The choice for a solution depends on the features you need.


Basic features

The basic features are:

  • WORD:  find a word into a text
  • BOOLEAN: find ‘(a and b) or c’
  • PHRASE: find a sequence of words in the same order, i.e. find ‘hello world’, this is different from ‘hello’ and ‘world’
  • WILDCARD search with truncation, like play*

Advance features

 

A FTR engine can be very sophisticated, but even if a feature seems very interesting, the questions are:

  • does the customer understand it? Use it? f.i. NEAR
  • is this relevant for my application? f.i. ‘stemming’ a feature that let you find all derived words, fi: play – playing – played. This is of course not efficient in a Multilanguage environment
  • what is the maintenance? f.i. ‘thesaurus’ is a very nice feature, but the knowledge (Brussels is a narrower term of Belgium, …) must be created and maintained, …

The advanced features are:

 

  • NEAR(/n):  find some words separated by maximum n words. 
    Probably too complicated for a standard web user
  • SENTENCE / PARAGRAPH: find some words in the same sentence / paragraph.
    Probably too complicated for a standard web user
  • STEMMING: find a word and its derived forms, i.e. go -> going, went, … 
    It is only working in a single language environment
  • TYPO: search with syntax error, like elipse -> ellipse.
    Can be very interestring as most of people write with typo
  • SOUND LIKE: find a word and its ‘sound like’ terms, like ‘Gorbatchev’ - ‘Gorbatchov’ 
    It is language dependent, so only working in a single language environment
  • THESAURUS: this is a more sophisticated feature, find a word and its relations, synonyms, … i.e.: Belgium -> Antwerp, Brussels, Nato, ECC, … 
    Requires a knowledge database and a lot of maintenance, can be implemented in a multilingual environment
  • RELEVANCE RANKING: rank the hits by relevance (more match in the same text is more relevant than a few match in another, the document size is also important) 
    Maybe interesting
  • WEIGHT: correlated to the relevance ranking, this is the possibility to give a weight to each word used in a search, i.e. Belgium 100% or Brussels 50% 
    Maybe, but already complicated for a standard web user
  • MULTILANGUAGE: search in different language, i.e. Belgium -> Belgique, Belgïe. This can be combined with weight 
    Maybe, but does a user want to find a document in another language?
  • FILTER: be able to index document like word, excel, PowerPoint, pdf, … 
    Probably
  • FUZZY: used to search in an OCRed text. The OCR engine does not recognize all characters and some words are incorrect. This kind of search will find something like ‘Skynet’ on a text containing 'Sk@net' where the ‘y’ was not recognized by the OCR engine 
    Only interesting when you handle OCRed texts
  • STOP: a list of word to skip like ‘a’, ‘the’, … 
    Probably interesting to reduce the load on your search engine

Depending on your needs, you will have to choose for a certain product (google, topic, Oracle or SqlServer database, ...)

11:10 Gepost door Melkebeke Thierry in Technical | Permalink | Commentaren (0) | Email dit | Tags: thesaurus, search, multilanguage |  Facebook |

18-04-07

How to Store User Generated Content (part 2)

Light Multimedia Storage

From a technical and architectural point of view

Images/video stored in Databases as Binary Large Object (LOB)

It's the ideal case when efficiency is not needed and data consistency is the only thing that matters.
With databases, you don't need to care about how many servers you have, as every server has access to the database, every server has access to the content.

Experience shows that it is ideal while total size of User Generated Content (UGC) is lower than 10 GB, after that limit you will start experiencing performance problems and backups problems => huge restoration times => huge downtime.

With UGC size > 20 GB, the access time and the query speed will dramatically decrease. You may experience downtimes, maximum database connection reached, dead locks, abusive concurrent access, excessive IO…
In measurable terms, 20 GB of content means: 200 users uploading 10 daily pictures (total 1Mb) during 100 days.

Then you will need to redesign your solution.

Another disadvantage with databases is that there is no real image/video directly available, so every time an image/video is requested you will have to download it first from the database and then to provide the temporary (or definitive) location of the data to the user who will download it afterwards; this solution is known as caching.

Here are our graphics based on our real result:

 

full_db_small_business
Fig. 1: while the db is small
 

 

 

full_db_medium_business
Fig. 2: with a medium db

 

 

full_db_huge_business
Fig. 3: the database reaches his maximal size (with acceptable performances)
 

 

Advantages / Disadvantages

Advantages:

  • Data consistency is maintained by the database
    • Easy to maintain
    • Implicit cleaning


Disadvantages:

  • If content is kept forever, the model is not a long-term design
  • A Caching mechanism should be implemented
  • The time needed to access the content may drastically increase if the database have a lot of IO (big tables, intensive requests)

Usage

This kind of UGC storage is ideal when the amount of image is kept small and don't last long (bargains, news…) anything which is limited in time and will be removed.

See also

How to Store User Generated Content

17:28 Gepost door Barbieri S&eacute;bastien in General | Permalink | Commentaren (2) | Email dit | Tags: user generated content, ugc, lob, nas, light multimedia storage, database |  Facebook |

26-03-07

Is Second Life close to reality?

The virtual community is growing fast. During the last quarter of 2006, the number of  “residents” has more than doubled.  SL is now becoming a real phenomenon. Not only the early adaptors are talking about it but even people who don’t surf (yes, these people are still existing) know this virtual world.


There are currently more than 4.8 million residents living in SF. Of course the majority of them aren’t active visitors but nevertheless nobody did ever foresee this result.


More crazy is the fact that real money is circulating over there.  More than 1million € is spent every day in SF.  US Dollars can be convert in Linden dollars at a rate of 1US$ for 277L$.  Like in real life you can earn or spend money. You can become rich or poor.


Residents can buy or sales land, products, clothes, and also pay taxes.  But real companies are also present there (Coca-Cola, Philips, ING, IBM, Nissan etc.), in order to promote their brand and communicate about their product range.

Soon or later they will test all their new products in SF before launching them in the real world.  Sure that it will be more powerful and less risky to test reactions/perceptions of residents   at big scale and true different communities of SF.  After all SL resident’s behavior is also the mirror of real people meaning real consumer.


The success of this virtual world is also due to the fact that people have the control. They create or produce what they want: avatar, house, rules, shops, games, movie, dvd…..there is almost no limit. They have full control on your own life. Ok it’s virtual but for some people it’s already better than reality
When we say anybody can do anything, it’s really the case.


Former Virginia governor M. Warner, when he intended to run for presidency, created Second Life's first political whistle-stop as an avatar of himself. Duran Duran and Suzanne Vega performed live concerts — playing into microphones live so the music transmits through their avatars
Few days ago Sweden announced it will be the first country to open a virtual embassy in the game. Online visitors at the "embassy" can learn how to apply for a visa or find some general information, which the Sweden hopes will help encourage real-world tourists.


Soon we‘ll maybe seen the creation of the Second Life NFL with same existing teams, playing the same match at the same time, with the same sponsors. But maybe with different results. We can also imagine to create the Second Life Tennis Association with the same players participating to the same Grand Slam and Masters Series with the difference that Roger Federer won’t win almost all tournaments.


SL is part of the Internet hype. Important growth, increase number of residents which means a huge amount of personal data. As often in this case success generates envy from hackers. SL has been hacked few months ago. Data of more than 650.000 users have been stolen. Hackers reach the general database through web servers of the company. As a consequence the password of all residents have been changed. Nothing dramatic. Just a big press coverage and a faster increase of residents.


Technology used for this platform is not that impressive. Beside the fact that the GUI is poor and certainly not as good as some online games, the code is evolving thanks to all users. A part of the source code is available in order to allow users to modify and plug software into the site, potentially accelerating Second Life's development.


So, is Second Life close to real world? Well…of course it’s not that easy to change your life in 5 clicks and you cannot get everything you want so easily. But the way people try to interact together, organize their life, try to set up some rules, the way companies are communicating to the residents etc. is not that far from real world.

Link: 

 

  1. http://www.secondlife.com/
  2. http://www.techcrunch.com/2006/09/08/metaverse-breached-s...
  3. http://www.thelocal.se/6219/20070126/

12:18 Gepost door Massin Fabrice in General | Permalink | Commentaren (0) | Email dit | Tags: virtual community |  Facebook |

The end of the 30 sec spot ?

Most of TV viewers hate advertising (the so called “30 seconds spot”). The soccer fan doesn’t appreciate to be forced to watch a commercial (even when it’s about beer) when his team dominate the game, the teenager grouses when his daily soap is cut twice in 30 minutes and the mother doesn’t  understand why her  serial drama is always interrupted in the middle of an important action.

Of course advertisers know this. But they also know that people don’t want to miss part of their TV favorite program and that’s the reason why they are choosing to impose the 30 sec spot. We know they know and they know we know but they don’t care about us. They just thing about the money they receive.


So when the Digital Video Recorder (DVR) came on the market at a reasonable price we, the “victims”, though it was the end of the ads. We will never be obliged to watch something we don’t want when we don’t want. The total freedom…


The advertisers tried to fight against it, but they rapidly understood they will never win this battle. This time they have to care about the viewer in order to keep a positive ROI.  Because we are able to zap the spot, they are obliged to make it more interesting or more funny and even sometimes dedicated to our needs.


New and creative solutions are appearing. In fact these are more dedicated to TV/DVR usage. The purpose is to capture viewer’s interest by letting him the choice to watch more and in a different way. The advertising product is less impersonal  and brings a new experience to the viewer.


The deal between Tivo and Lexus is a good example for this kind of solution.

Tivo customers are able to watch info on the Lexus, locate the nearest dealer and schedule a test drive but they can also virtually customize a car to their needs. And this, only if they want.


Sure that the relations between viewers and advertisers is changing.

 

Links:

  1. http://www.tivo.com/cms_static/press_131.html
  2. http://www.tivo.com/1.2.asp

 

12:17 Gepost door Massin Fabrice in General | Permalink | Commentaren (0) | Email dit | Tags: dvr, tv spot, tivo |  Facebook |