The Semantic Information Technology
The purpose of the Semantic Information Technology (SIT) is: to provide the transition from the formalistic and syntax-based computer informatics into informatics that works with the sense of the information, i.e. semantic informatics. Because, not processing the sense IT does not process information, but just data.
to change the text (including web-pages) from passive and detached sequence of characters into a system of active and included in a global semantic system concepts and object (semantic units), and to wash away the spirit of paper still reigning in IT.
to transform the content of the database tables from strings and numbers into identifiers of semantic units. Because, the symbol of today's IT is still an impersonal column of numbers.
to transform the source code and every executable copy of the software products from a system made of formal and isolated instructions and groups of instructions into a system of semantic units, included in the Global Semantic Information System (GSIS). Because programming languages (C, Java, etc.) are languages under condition. As you compare them with true languages like Bulgarian or English, you see that C & Java are extremely primitive sets of instructions, data types and functions, with context-free grammars.
to add a semantic layer into every picture, sound record and video record and to include it in the GSIS.
to create a semantic operating system, i.e. a system that includes semantic units like computer, memory, hard disk, file, etc.
Summarizing - the purpose of SIT is creating of maximally close to the truth model of the reality (which is definitely human reality), a model which includes physical (body, objects, energies), psychological (sensations, emotions, needs, values), mental (images, concepts, language) aspects of the reality. Whereas CYC builds a model of one generic and abstract human, SIT builds models of thousands generic and of billions of individual people.
SIT is not based on an abstract and formal mathematical theory. It is based on the principles of intelligence. It is a radical rupture with sequential execution, cycles, branches, variables, data types, functions, and partially with files (in a semantic operation system the rupture is full).
SIT stores and processes the information in a way many times closer to reality, true to the life, than old technologies do. SIT is closer to life because the human world is built up not of data, but of huge amount and diversity of structured units, which connect and interact in thousand ways.
The main technical aspect of the new technology is that things are presented not as conventional files or records, but as simple and universal elements-bearers of sense or intellectual units (semantic units), which are on one hand separate, but on the other hand can join together and make new sense elements. So these elements, like atoms, do form whole information universes. Analogues of these elements are the words of a human language (more precisely - the structures in our brains, that are denoted by the words). In the language, joining of verbs, nouns, adjectives and other sense bearers leads to creation of new bearers of sense - the sentences and the thoughts. In a semantic system every semantic element acquires his sense from his relations and interactions with other semantic units.
More technically speaking, a semantic unit is an index (a list) of all contexts and situations in which it participates; of all actions which it performs; of all its parts, attributes and substances; of all opinions, statements and attitudes about it; of all its states, forms and roles; of all words, names and expressions, denoting it. Every individual semantic unit can have 2 as well as 20.000 attributes (relations), without complication of the structure, without serious slow-down of the search and the processing.
SIT needs just small traditionally coded software to run, because all system's functionality is coded as interacting semantic units (which, for speed up, could be translated into machine code in the background). Also, future information systems are developed not by writing new software versions, but by enriching existing systems with new semantic elements and relations.
All information entities are subjected to semantic transformation - even "simple" letters and numbers cease to be computer memory cells and turn into semantic units, structured as all other semantic units and having many relations to other semantic units. In SIT the meaning is entirely contained in the system of semantic objects and their relations. The meaning is not dispersed in programme's code, files and anywhere (as it is in the conventional information technologies).
Today every programmer or database architect invents its own 'words' (identifiers) and invents its own 'concepts' (data structures) for every new project he/she works on. This situation does not facilitate the exchange of experience or the accumulation of knowledge. Knowledge can be accumulated only if there is a stable, long-lasting and widely accepted concept (a semantic unit) to accumulate something around it. SIT changes the present technologic lack of communicativeness, because by definition it includes potentially all words and word groups from potentially all human languages, together with the semantic units, which they denote.
The big importance of the semantic units in SIT reflects the big importance of the entireties in the world of humans.
Instead rules and predicates SIT uses abstract and generic semantic units together with individual semantic units in one system.
What is very important in the semantic technology: semantic elements are a result of long-term accumulation. Thence, instead of creating new and separate semantic elements, we make relations to already existing, common, universal semantic elements. An example: Every company has employees. If for every newly hired employee you enter his information from paper documents or from incompatible file formats, this means serious time and work over-expenditures. Furthermore the size of personnel information to enter and process continuously increases. If you get from Internet and use the global semantic unit for that employee you avoid entering already existing information, and more important - you have access to potentially unlimited source of information about that person. Last but not least, if all people participate in this semantic agreement, then the society has an information system, much more perfect than of today. Thence, informatics grows around semantic objects as one system. On the contrary, now informatics grows around the software. Present software and information systems are in principle local and short-term phenomena. Thence, the semantic information system is global. Thence, it is one, common and universal. Thence, this is a revolution, but not because I'm a megalomaniac or because I have a whim. It is so just because the progress forward is unavoidable. The semantic information system is the absolute apogee of all information systems. The semantic information system corresponds to the advancing global economy. Furthermore, the very important precondition - Internet, is here.
This is one of the main aspects of the SIT - the reuse of big amounts of information - already prepared elements, that have global eternal unique identifiers - people, goods, companies, populated places, countries, real estates, laws, streets, languages, government services, professions, projects of products, wars, purchases, weddings, travels, and many, many others.
At heart, SIT supports identity. In the whole semantic system there is only one semantic element representing a specific number (for example the number 7). In every place where this number is used, we have a reference to the only semantic element for the number, instead having a new and separate semantic element for that number. If instead of one, we had two, it could happen so that the first has the name 'sedem' (in Bulgarian), while the second - 'seven' (in English). So, instead of knowing the name of the number '7' in two languages, sometimes it will be 'sedem', sometimes - 'seven'. This state, using a term from psychology, is a sickness, called schizophrenia.
Contexts have great importance in SIT. 'Semantic' means context-aware. A context is every entirety or scene or situation, based on physical, temporal, mental relation and closeness, or constructed by logical relations. A context is every semantic unit and every set of semantic units, which is not an indivisible atom, has parts, members, contents, relations and towards which can be used the preposition "in".
SIT allows as much definitions of every thing and phenomenon as much contexts there are. There can be a definition of a company in the context of a person, second definition in the context of a state, third definition in the context of a culture, fourth definition in the context of a speech, etc. (By 'definition' I understand one or another set of relations of the semantic unit.)
SIT includes all human languages. Every word is a semantic unit of full value and can participate in millions connections to other semantic units.
SIT has all flexibility, richness and elegance of the human languages. The contrast with sketchiness and narrowness of formal-procedural computer languages is remarkable. Combining semantic units, with a few units you can express much more real-world objects, and can tell something new using only known units. Combining semantic units there are many ways to express one and the same thought. Similarly to human languages, SIT uses generalizations, and organizes the concepts hierarchically in one system. Similarly to human languages, SIT directly denotes things that exist in the real life - cities, people, subjective states and experiences. There is nothing like this in computer pseudo-languages. SIT like human languages can work with different parts and aspects of the things (i.e. abstraction, that allows a transfer of experience from a known into an unknown element, provided that new element has enough common characteristics with the old. This abolishes the necessity to describe in details every new element). Like Bulgarian language, SIT uses gradations, for example: 'never' - 'rarely' - 'normal' - 'often' - 'always'. Like Bulgarian language, SIT uses relativity: (a) toward the norm (with sign '+' or '-'), for example 'many', 'little'; (b) toward a point, for example 'after two hours'; (c) toward the quantity of same kind, for example 'more than', 'quicker than'; (d) toward a part, for example 'front part', 'back part'; (e) toward space and time, for example 'before', 'above', 'after'. Like English language, SIT uses metaphors. SIT uses multitudes, for example 'flock', 'Varna citizens'. SIT uses different degrees of definiteness, approximation and inexactness, for example 'here', 'there', 'about 500'. SIT references something/someone using a particular characteristic or aspect of it/him, for example 'you, fat man', the same way as Romanian language does. SIT uses relations by analogy, similarity, for example 'he is like a dog'. SIT concentrates in one semantic unit huge amount of meaning - as in a human language if you say for someone 'good', you have said the most important, and details are unnecessary; the semantic unit 'freedom' contains more value than the semantic unit 'money', no matter how much mercantile our days are.
An integral part of SIT is its ability to translate from one language to another. It is not a separate module linked by a few functions as it is in the conventional software. This ability results from the organization of semantic units in SIT. All semantic units are denoted by natural language words or combination of words from many languages, so the translation is a flow of activity from incoming words to semantic units and then to words (and grammar rules) in the other language.
An important feature of the SIT is the partial recognition, or recognition of not complete and erroneous information. This is especially important in text and image recognition where errors are unavoidable. To recognize a word with a wrong letter, with a missing letter or with swapped letters, the searching is performed not by words, but by letters. A word is recognized if it has the biggest part of recognized letters among all other words. Similarly is organised the recognition of all other elements - words, word groups, sentences, photographs, audio- and video-records.
System based on formal logic like CYC can not make good description of the statement "the table is flat", whereas SIT can, because it describes things by the states of a human who is watching, touching, moving around, thinking and speaking about the flatness of the table. I.e. flatness is a state of the visual, tactile, motive, emotional, associative, and verbal parts of the human brain, appropriately modelled as a system of semantic units.
SIT is based on the natural historic state of affairs, not on extracted from the context, experience, and development, abstract & formal ideas and theories. Recognizing normal/not normal is very important in SIT. SIT brings the things back to their context, returns the individuality of things taken away by the conventional IT.
While in traditional information technologies numbers are just numbers - 2, 56, 3922, etc., in the semantic technology there aren't just numbers, but mostly weights, lengths, counts, sums and other quantitative semantic elements.
The only semantic units related to data of primitive types 'string' are those which have literal representation in the life too - words, names, codes, numbers, digits, signs, etc.
An important aspect of the new technology is the dependency on size and diversity. To work well, to show the advantages of the semantic technology, a semantic system needs to reach specific critical mass. The bigger and diverse is it - the better it works (setting aside other factors).
Why business needs SIT? - Today merging two enterprises is usual task, while merging their information systems is unusually difficult task, even if their database management systems come from the same manufacturer (say Microsoft's SQL Server). Tables with identical functionality have different names, columns with identical functionality have different names, types of columns with identical functionality are different, at least lengths of the fields are different, or the number of columns in a table is different. If there by chance is coincidence of names, number, length etc., then certainly data will not match - in one table my name is 'G. Dimitrov', in another 'Georgi Dimitrov', in the third 'G Dimitrov'. In all three cases the names are totally different according to today's multi-billion software products. They hardly could be called high technologies.
The main purpose of the science is to create proved, widely accepted and common system of knowledge. SIT offers scientists the perfect language to conduct and write their works. A language, much better than mathematics. Because everything written immediately becomes a part of a working model - the Global Semantic Information System.
Main sources of information: more than 400.000 categories from Open Directory Project; more than 109.000 synsets from Princeton's WordNet 1.7; Upper Cyc Ontology from Cycorp; More than 15.000 categories from Universal Standards Products and Services Categories (UNSPSC); more than 350 MB texts from Encyclopaedia Britannica; the World Factbook from CIA; many tens of thousands of biographies from different sources; catalogues of commodities; Big Bulgarian-English, English-Bulgarian and other bilingual dictionaries; big dictionaries of English, Bulgarian and other languages; information about tens of thousands Bulgarian and other countries' enterprises; many ontologies from DAML; Standard Industrial Classification (SIC) Codes; North American Industry Classification System; classifications of diseases, medicines, sciences, plants and animals, chemical compounds and other; thousands of human names, human languages, units of measurement, cities, streets, moving pictures, universities and colleges, professions, software products, programming languages and many other; texts of hundreds of books; texts of Bulgarian legislation; vast amount of other dictionaries, encyclopaedias, reference books and web-pages.
Resources needed: People who know well different human languages - on first stage the official european languages plus Bulgarian; specialists experienced in natural language processing, knowledge engineering, data representation and processing, computer programming; high-speed Internet access; high-traffic web-server; computer hardware and software; information from sources of great diversity.
Goals, in concrete: To ensure SIT's market domination by building and disseminating a huge base of semantic information (first build of the Global Semantic Information System) comprising at least 50 million semantic units from every area of knowledge; belonging to one or more of at least 350.000 generic notions; having at least 500 million relations; every semantic unit having relations to words, phrases and names in all of official european languages; every global semantic unit having a global eternal semantic identifier. Especially words and idioms of official EU languages - not less than 1.200.000 basic words and idioms. At least 5.000 different types of people like: bulgarian, englishman, teacher, citizen, Varna citizen, sprinter, etc. The semantic information will be disseminated by issuing "encyclopaedias" on compact disks and by on-line publishing (incl. the catalogue of semantic units - the "Sematic Yahoo"). To achieve natural language understanding, translation and interaction. To build an operating system written as interrelated and interacting semantic units (using the source codes of Linux). To build the semantic server & client as semantic units.
Using evaluation criteria: The project have a large scientific and technological impact. This research opens new prospects for IST. It will have large economic impact and contribute to solving societal problems. Potential long term benefits are sufficiently large to justify the level of risk of the project. The objectives are challenging and clearly defined. They represent clear progress well beyond the current state-of-the-art. The research is highly innovative. The proposed S&T approach is well thought out. It enables the project to achieve its objectives.